nexaml/parakeet-tdt-0.6b-v2-MLX

Quickstart

Run them directly with nexa-sdk installed In nexa-sdk CLI:

nexaml/parakeet-tdt-0.6b-v2-MLX

Overview

parakeet-tdt-0.6b-v2 is a 600-million-parameter automatic speech recognition (ASR) model designed for high-quality English transcription, featuring support for punctuation, capitalization, and accurate timestamp prediction. Try Demo here: https://huggingface.co/spaces/nvidia/parakeet-tdt-0.6b-v2

This XL variant of the FastConformer architecture integrates the TDT decoder and is trained with full attention, enabling efficient transcription of audio segments up to 24 minutes in a single pass. The model achieves an RTFx of 3380 on the HF-Open-ASR leaderboard with a batch size of 128. Note: RTFx Performance may vary depending on dataset audio duration and batch size.

Key Features

  • Accurate word-level timestamp predictions
  • Automatic punctuation and capitalization
  • Robust performance on spoken numbers, and song lyrics transcription

For more information, refer to the Model Architecture section and the NeMo documentation.

This model is ready for commercial/non-commercial use.

Benchmark Results

Huggingface Open-ASR-Leaderboard Performance

The performance of Automatic Speech Recognition (ASR) models is measured using Word Error Rate (WER). Given that this model is trained on a large and diverse dataset spanning multiple domains, it is generally more robust and accurate across various types of audio.

Base Performance

The table below summarizes the WER (%) using a Transducer decoder with greedy decoding (without an external language model):

Model Avg WER AMI Earnings-22 GigaSpeech LS test-clean LS test-other SPGI Speech TEDLIUM-v3 VoxPopuli
parakeet-tdt-0.6b-v2 6.05 11.16 11.15 9.74 1.69 3.19 2.17 3.38 5.95

Noise Robustness

Performance across different Signal-to-Noise Ratios (SNR) using MUSAN music and noise samples:

SNR Level Avg WER AMI Earnings GigaSpeech LS test-clean LS test-other SPGI Tedlium VoxPopuli Relative Change
Clean 6.05 11.16 11.15 9.74 1.69 3.19 2.17 3.38 5.95 -
SNR 50 6.04 11.11 11.12 9.74 1.70 3.18 2.18 3.34 5.98 +0.25%
SNR 25 6.50 12.76 11.50 9.98 1.78 3.63 2.54 3.46 6.34 -7.04%
SNR 5 8.39 19.33 13.83 11.28 2.36 5.50 3.91 3.91 6.96 -38.11%

Telephony Audio Performance

Performance comparison between standard 16kHz audio and telephony-style audio (using μ-law encoding with 16kHz→8kHz→16kHz conversion):

Audio Format Avg WER AMI Earnings GigaSpeech LS test-clean LS test-other SPGI Tedlium VoxPopuli Relative Change
Standard 16kHz 6.05 11.16 11.15 9.74 1.69 3.19 2.17 3.38 5.95 -
μ-law 8kHz 6.32 11.98 11.16 10.02 1.78 3.52 2.20 3.38 6.52 -4.10%

These WER scores were obtained using greedy decoding without an external language model. Additional evaluation details are available on the Hugging Face ASR Leaderboard.

Reference

Downloads last month
98
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nexaml/parakeet-tdt-0.6b-v2-MLX

Finetuned
(13)
this model