Higgs Audio Tokenizer
Check our open-source repository https://github.com/boson-ai/higgs-audio for more details!
We introduce a new discretized audio tokenizer that runs at just 25 frames per second while keeping—or even improving—audio quality compared to tokenizers with twice the bitrate. Our model is the first to train on 24 kHz data covering speech, music, and sound events in one unified system. It also uses a simple non-diffusion encoder/decoder for fast, batch inference.
Basics of Audio Quantization
An audio signal sampled at Hz is first split into frames by an encoder with hop size , giving a frame rate Two common quantizers are:
- Residual Vector Quantization (RVQ): cascaded vector‑quantizer layers, each with codebook size . When , it degenerates to ordinary vector quantization.
- Finite Scalar Quantization (FSQ): A single-layer scalar quantizer in which every scalar coefficient is independently mapped to one of discrete levels.
If every combination of codewords is a token, the vocabulary size is , and each token needs bits. The overall bitrate (bits/s, BPS) is simply .
We aim to push this bitrate as low as possible without hurting audio fidelity.
What Makes Ours Better
- Low Frame Rate: Runs at just 25 fps, halving the frame rate of many baselines while preserving high audio quality.
- Unified 24 kHz Training: A single model jointly trained on speech, music, and sound‑event data, capturing both semantic and acoustic nuances and greatly simplifying downstream audio‑language‑model training.
- Fast Inference: A non‑diffusion encoder/decoder that processes batches quickly, making it practical for real-time or large-scale tasks.
Evaluation Data and Metrics
We test on four subsets:
Speech, Music and Sound Event: Include 1,000 clips per category, with each clip lasting 10 seconds. Clips are randomly sampled from DAPS (Speech), MUSDB (Music), and AudioSet (Sound Event).
Audiophile: Contains 150 clips, each 30 seconds long, curated from eleven high-fidelity test discs that were designed for perceptual listening tests. The clips feature both high-quality music and sound events.
We measure:
- Acoustic Quality: Acoustic reconstruction error between the original and reconstructed audio.
- Semantic Integrity: Degree of semantic preservation, evaluated on the English and Chinese subsets of SeedTTS[15].
- Aesthetics: SOTA unified model-based quality metrics computed with Meta Audiobox Aesthetics[8].
We compare our tokenizer with a wide range of baselines, from tokenizers mainly built for better acoustic reconstruction and compression rate, to those focused on semantic integrity, and to tokenizers used in existing large audio language models. We also compare with tokenizers that are pretrained specifically on speech or on music.
The tables below summarize the tokenizers evaluated. As shown, our tokenizer achieves a well-rounded balance of efficiency, semantic fidelity, and acoustic quality.
Acoustic Evaluation
This table reports the Short‑Time Fourier Transform (STFT) distance between the original and reconstructed audio. Baselines are listed chronologically and grouped by whether semantic distillation (SD) is applied. Despite DAC’s top acoustic quality at 12× the bitrate, our tokenizer leads all other baselines.
Tokenizer | 💬 | 🎵 | 🥁 | SD | BPS* (k) ↓ | Speech ↓ | Sound Event ↓ | Music ↓ | Audiophile ↓ | ||
---|---|---|---|---|---|---|---|---|---|---|---|
Encodec[3] | ✓ | ✓ | ✓ | 24 | 75 | 24 | 1.96 | 2.65 | 2.52 | 2.30 | |
DAC[2] | ✓ | ✓ | ✓ | 24 | 75 | 24 | 1.13 | 1.45 | 1.34 | 1.62 | |
SNAC-24k[6] | ✓ | 24 | (12, 23, 47) | 0.98 | 1.92 | 2.69 | 2.54 | 2.52 | |||
SNAC-44k[6] | ✓ | ✓ | 44.1 | (14, 29, 57, 115) | 2.6 | 1.83 | 2.25 | 2.05 | 2.00 | ||
WavTokenizer[7] | ✓ | ✓ | 24 | 75 | 0.9 | 1.93 | 2.44 | 2.17 | 2.15 | ||
WavTokenizer (Speech)[7] | ✓ | 24 | 75 | 0.9 | 1.78 | 2.47 | 2.42 | 2.47 | |||
MuCodec[11] | ✓ | 48 | 25 | 0.35 | 2.87 | 3.69 | 3.36 | 2.97 | |||
FlowDec-75m[12] | ✓ | ✓ | ✓ | 48 | 75 | 7.5 | 1.73 | 2.14 | 2.01 | 2.03 | |
FlowDec-25s[12] | ✓ | ✓ | ✓ | 48 | 25 | 4 | 1.94 | 2.42 | 2.25 | 2.33 | |
SpeechTokenizer[14] | ✓ | ✓ | 16 | 50 | 4 | 3.21 | 3.58 | 3.65 | 3.69 | ||
SemantiCodec[5] | ✓ | ✓ | ✓ | ✓ | 16 | 100 | 1.35 | 3.05 | 3.28 | 3.24 | 3.18 |
Mimi[13] | ✓ | ✓ | 24 | 12.5 | 4.4 | 1.77 | 2.40 | 2.30 | 2.15 | ||
XCodec[1] | ✓ | ✓ | ✓ | ✓ | 16 | 50 | 4 | 2.95 | 3.16 | 3.00 | 3.03 |
CosyVoice 2[13] | ✓ | ✓ | 16 | 25 | -** | 2.30 | 3.30 | 3.14 | 3.25 | ||
XCodec2[9] | ✓ | ✓ | 16 | 50 | 0.8 | 3.06 | 3.72 | 3.62 | 3.64 | ||
XY[10] | ✓ | ✓ | 24 | 12.5 | 1 | 1.89 | 2.51 | 2.40 | 2.26 | ||
Ours | ✓ | ✓ | ✓ | ✓ | 24 | 25 | 2 | 1.62 | 2.03 | 1.85 | 1.80 |
* Bits-per-second is calculated according to the checkpoint the author provided.
** CosyVoice 2 uses the continuous feature as the conditioning; we include it for completeness.
Semantic Evaluation
SeedTTS is a dataset that includes prompt/target audio and texts. We reconstruct the target audio, and use the word error rate (WER) and speaker similarity (SIM) metrics to evaluate the semantic integrity. SIM is calculated by the similarity between the prompt audio and reconstructed target audio with WavLM-large as the embedding model.
The following table compares our tokenizer with semantic-distillation-trained baselines and shows that it delivers performance comparable to tokenizers operating at 2.2× our model’s bitrate.
Model | BPS (k) | en WER ↓ | en SIM ↑ | zh WER ↓ | zh SIM ↑ |
---|---|---|---|---|---|
SpeechTokenizer | 4 | 2.82 | 0.63 | 2.04 | 0.65 |
SemantiCodec | 1.35 | 3.46 | 0.56 | 2.18 | 0.60 |
Mimi | 4.4 | 2.35 | 0.70 | 1.48 | 0.72 |
XCodec | 4.0 | 2.68 | 0.63 | 1.66 | 0.66 |
CosyVoice 2 | - | 3.17 | 0.65 | 2.11 | 0.70 |
XCodec2 | 0.8 | 2.74 | 0.62 | 1.91 | 0.67 |
XY-MOSS-TTSD | 1.0 | 2.72 | 0.61 | 1.58 | 0.67 |
Ours | 2.0 | 2.52 | 0.67 | 1.48 | 0.71 |
Audiobox Aesthetics Evaluation
This model-based evaluation[8] further demonstrates the superiority of our tokenizer. CU denotes the Content Usefulness and CE denotes the Content Enjoyment; both are rated on a 1-10 scale. Notably, our tokenizer performs best on the Audiophile set, demonstrating a clear advantage when the original audio quality is high.
Model | BPS (k) | Music CE ↑ | Music CU ↑ | Sound Event CE ↑ | Sound Event CU ↑ | Speech CE ↑ | Speech CU ↑ | Audiophile CE ↑ | Audiophile CU ↑ |
---|---|---|---|---|---|---|---|---|---|
Origin | - | 6.20 | 7.10 | 4.47 | 5.64 | 5.03 | 4.87 | 7.17 | 7.65 |
SpeechTokenizer | 4.0 | 3.55 | 5.22 | 3.03 | 4.50 | 4.68 | 4.58 | 3.59 | 5.07 |
SemantiCodec | 1.35 | 6.01 | 6.83 | 4.22 | 5.30 | 4.28 | 4.12 | 6.97 | 7.43 |
Mimi | 4.4 | 6.01 | 6.83 | 4.26 | 5.35 | 4.87 | 4.72 | 6.80 | 7.29 |
XCodec | 4.0 | 6.30 | 7.10 | 4.43 | 5.45 | 4.96 | 4.79 | 7.06 | 7.49 |
CosyVoice 2 | - | 5.21 | 6.14 | 4.08 | 4.73 | 4.91 | 4.75 | 5.97 | 6.56 |
XCodec2 | 0.8 | 4.38 | 5.66 | 3.43 | 4.63 | 4.93 | 4.78 | 4.56 | 5.46 |
XY-MOSS-TTSD | 1.0 | 5.77 | 6.80 | 4.23 | 5.34 | 4.88 | 4.72 | 6.95 | 7.48 |
Ours | 2.0 | 6.35 | 7.15 | 4.47 | 5.51 | 4.90 | 4.70 | 7.21 | 7.66 |
Note that since some tokenizers are trained on 16 kHz data, we upsample their audio outputs to 24 kHz before computing metrics. Different upsampling methods may cause slight variations (e.g., 4.36 vs. 4.43 for XCodec Sound Event CE). We report the best results we could obtain and highlight any results within 0.05 of the best one.
Reference
[11] Xu, Yaoxun, et al. "MuCodec: Ultra Low-Bitrate Music Codec." arXiv preprint arXiv:2409.13216 (2024).
- Downloads last month
- 3,700