YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation

News

This paper is currently under review. We have released the checkpoint of U-Codec (5Hz), which can be directly used for inference.

To do list

Provide the full training code for the U-Codec framework.
Release the public code of the TTS models built on top of U-Codec.

If you are interested in U-Codec, feel free to contact us!

Overview

We propose U-Codec, an Ultra low frame-rate neural speech Codec that achieves high-fidelity reconstruction and fast generation via an extremely frame-rate at 5Hz (5 frames per second). Extreme compression at 5Hz typically leads to severe intelligibility and spectral detail loss, we overcome this by integrating a Transformer-based inter-frame long-term dependency module and systematically optimizing residual vector quantization (RVQ) depth and codebook size. Moreover, we apply U-Codec into a large language model (LLM)-based auto-regressive TTS model, which leverages global and local hierarchical architecture to effectively capture dependencies across multi-layer tokens.

The overview of U-Codec as following picture shows.

How to inference U-Codec

We provide an example to demonstrate how to run U-Codec (5Hz) for audio tokenization and reconstruction.

Environment Setup

First, create a Python environment following a similar setup to project page.

conda create -n ucodec python=3.8
conda init
source ~/.bashrc
conda activate ucodec

Then:

cd U-Codec
bash requirements.sh

Run Inference

If you need pretrained weights, please download them on the Checkpoint.

We provide an example script AudioTokenizer_UCodec.py for tokenizing audio into discrete codes and reconstructing audio from the codes.

cd tools/tokenizer/soundstream
python AudioTokenizer_HY.py

You can directly use the released U-Codec 5Hz checkpoint for inference. More examples (e.g., TTS pipeline integration) will be released soon.

Citation

If you find this code useful in your research, please cite our work and give us a star

@inproceedings{U-Codec,
  title     = {U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation},
  author    = {Xusheng Yang, Long Zhou, Wenfu Wang, Kai Hu, Shulin Feng, Chenxing Li, Meng Yu, Dong Yu, Yuexian Zou},
  booktitle = {arXiv},
  year      = {2025}
}

Contact us

If you have any problem about the our code, please contact Xusheng ([email protected]).

License

You can use the code under MIT license.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support