WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers
Abstract
WhisTLE, a text-only adaptation method using a variational autoencoder, enhances pretrained ASR models with text-to-latent encoding and optional TTS adaptation, reducing word error rates across multiple datasets.
Pretrained automatic speech recognition (ASR) models such as Whisper perform well but still need domain adaptation to handle unseen vocabulary and parlance. In many real-world settings, collecting speech data is impractical, necessitating text-only adaptation. We propose WhisTLE, a deeply supervised, text-only adaptation method for pretrained encoder-decoder ASR models. WhisTLE trains a variational autoencoder (VAE) to model encoder outputs from text and fine-tunes the decoder using the learned text-to-latent encoder, optionally combined with text-to-speech (TTS) adaptation. At inference, the original encoder is restored, incurring no extra runtime cost. Across four out-of-domain datasets and four ASR models, WhisTLE with TTS reduces word error rate (WER) by 12.3% relative to TTS-only adaptation and outperforms all non-WhisTLE baselines in 27 of 32 scenarios.
Community
Small pretrained automatic speech recognition (ASR) models such as Whisper perform well but still need domain adaptation to handle unseen parlance. In many real-world settings, collecting speech data is impractical, necessitating text-only adaptation. We propose WhisTLE, a deeply supervised, text-only adaptation method for pretrained encoder-decoder ASR models. WhisTLE trains a variational autoencoder (VAE) to model encoder outputs from text and fine-tunes the decoder using the learned text-to-latent encoder, optionally combined with text-to-speech (TTS) adaptation. At inference, the original encoder is restored, incurring no extra runtime cost. Across four datasets and four ASR models, WhisTLE with TTS reduces word error rate (WER) by 52.8% and outperforms all non-WhisTLE baselines in 27 of 32 scenarios.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DeCRED: Decoder-Centric Regularization for Encoder-Decoder Based Speech Recognition (2025)
- A Survey on Non-Intrusive ASR Refinement: From Output-Level Correction to Full-Model Distillation (2025)
- TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling (2025)
- Large Language Model Data Generation for Enhanced Intent Recognition in German Speech (2025)
- VOX-KRIKRI: Unifying Speech and Language through Continuous Fusion (2025)
- MELA-TTS: Joint transformer-diffusion model with representation alignment for speech synthesis (2025)
- Analysis of Domain Shift across ASR Architectures via TTS-Enabled Separation of Target Domain and Acoustic Conditions (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper