nexaml/parakeet-tdt-0.6b-v2-MLX
Quickstart
Run them directly with nexa-sdk installed In nexa-sdk CLI:
nexaml/parakeet-tdt-0.6b-v2-MLX
Overview
parakeet-tdt-0.6b-v2
is a 600-million-parameter automatic speech recognition (ASR) model designed for high-quality English transcription, featuring support for punctuation, capitalization, and accurate timestamp prediction. Try Demo here: https://huggingface.co/spaces/nvidia/parakeet-tdt-0.6b-v2
This XL variant of the FastConformer architecture integrates the TDT decoder and is trained with full attention, enabling efficient transcription of audio segments up to 24 minutes in a single pass. The model achieves an RTFx of 3380 on the HF-Open-ASR leaderboard with a batch size of 128. Note: RTFx Performance may vary depending on dataset audio duration and batch size.
Key Features
- Accurate word-level timestamp predictions
- Automatic punctuation and capitalization
- Robust performance on spoken numbers, and song lyrics transcription
For more information, refer to the Model Architecture section and the NeMo documentation.
This model is ready for commercial/non-commercial use.
Benchmark Results
Huggingface Open-ASR-Leaderboard Performance
The performance of Automatic Speech Recognition (ASR) models is measured using Word Error Rate (WER). Given that this model is trained on a large and diverse dataset spanning multiple domains, it is generally more robust and accurate across various types of audio.
Base Performance
The table below summarizes the WER (%) using a Transducer decoder with greedy decoding (without an external language model):
Model | Avg WER | AMI | Earnings-22 | GigaSpeech | LS test-clean | LS test-other | SPGI Speech | TEDLIUM-v3 | VoxPopuli |
---|---|---|---|---|---|---|---|---|---|
parakeet-tdt-0.6b-v2 | 6.05 | 11.16 | 11.15 | 9.74 | 1.69 | 3.19 | 2.17 | 3.38 | 5.95 |
Noise Robustness
Performance across different Signal-to-Noise Ratios (SNR) using MUSAN music and noise samples:
SNR Level | Avg WER | AMI | Earnings | GigaSpeech | LS test-clean | LS test-other | SPGI | Tedlium | VoxPopuli | Relative Change |
---|---|---|---|---|---|---|---|---|---|---|
Clean | 6.05 | 11.16 | 11.15 | 9.74 | 1.69 | 3.19 | 2.17 | 3.38 | 5.95 | - |
SNR 50 | 6.04 | 11.11 | 11.12 | 9.74 | 1.70 | 3.18 | 2.18 | 3.34 | 5.98 | +0.25% |
SNR 25 | 6.50 | 12.76 | 11.50 | 9.98 | 1.78 | 3.63 | 2.54 | 3.46 | 6.34 | -7.04% |
SNR 5 | 8.39 | 19.33 | 13.83 | 11.28 | 2.36 | 5.50 | 3.91 | 3.91 | 6.96 | -38.11% |
Telephony Audio Performance
Performance comparison between standard 16kHz audio and telephony-style audio (using μ-law encoding with 16kHz→8kHz→16kHz conversion):
Audio Format | Avg WER | AMI | Earnings | GigaSpeech | LS test-clean | LS test-other | SPGI | Tedlium | VoxPopuli | Relative Change |
---|---|---|---|---|---|---|---|---|---|---|
Standard 16kHz | 6.05 | 11.16 | 11.15 | 9.74 | 1.69 | 3.19 | 2.17 | 3.38 | 5.95 | - |
μ-law 8kHz | 6.32 | 11.98 | 11.16 | 10.02 | 1.78 | 3.52 | 2.20 | 3.38 | 6.52 | -4.10% |
These WER scores were obtained using greedy decoding without an external language model. Additional evaluation details are available on the Hugging Face ASR Leaderboard.
Reference
- Original model card: nvidia/parakeet-tdt-0.6b-v2
- Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition
- Efficient Sequence Transduction by Jointly Predicting Tokens and Durations
- NVIDIA NeMo Toolkit
- Youtube-commons: A massive open corpus for conversational and multimodal data
- Yodas: Youtube-oriented dataset for audio and speech
- HuggingFace ASR Leaderboard
- MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages
- Granary: Speech Recognition and Translation Dataset in 25 European Languages
- Downloads last month
- 98
Model tree for nexaml/parakeet-tdt-0.6b-v2-MLX
Base model
nvidia/parakeet-tdt-0.6b-v2