FireRedTTS-1S: An Upgraded Streamable Foundation Text-to-Speech System

👉🏻 FireRedTTS-1S Paper 👈🏻

👉🏻 FireRedTTS-1S Demos 👈🏻

News

[2025/05/26] 🔥 We add flow-mathing decoder and update the technical report
[2025/03/25] 🔥 We release the technical report and project page

Roadmap

2025/04
- Release the pre-trained checkpoints and inference code.

Usage

Clone and install

Clone the repo

https://github.com/FireRedTeam/FireRedTTS.git
cd FireRedTTS

Create conda env

# step1.create env
conda create --name redtts python=3.10

# stpe2.install torch （pytorch should match the cuda-version on your machine）
# CUDA 11.8
conda install pytorch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 pytorch-cuda=11.8 -c pytorch -c nvidia
# CUDA 12.1
conda install pytorch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 pytorch-cuda=12.1 -c pytorch -c nvidia

# step3.install fireredtts form source
cd fireredtts
pip install -e . 

# step4.install other requirements
pip install -r requirements.txt

Download models

Download the required model files from Model_Lists and place them in the folder pretrained_models

Basic Usage

import os
import torchaudio![alt text](image.png)
from fireredtts.fireredtts import FireRedTTS

# acoustic llm decoder
tts = FireRedTTS(
        config_path="configs/config_24k.json",
        pretrained_path=<pretrained_models_dir>,
  )


"""
# flow matching decoder
tts = FireRedTTS(
        config_path="configs/config_24k_flow.json",
        pretrained_path=<pretrained_models_dir>,
)
"""

#same language
# For the test-hard evaluation, we enabled the use_tn=True configuration setting.
rec_wavs = tts.synthesize(
  prompt_wav="examples/prompt_1.wav",
  prompt_text="对，所以说你现在的话，这个账单的话，你既然说能处理，那你就想办法处理掉。",
  text="小红书，是中国大陆的网络购物和社交平台，成立于二零一三年六月。",
  lang="zh",
  use_tn=True
)




rec_wavs = rec_wavs.detach().cpu()
out_wav_path = os.path.join("./example.wav")
torchaudio.save(out_wav_path, rec_wavs, 24000)

Tips

The reference audio should not be too long or too short; a duration of 3 to 10 seconds is recommended.
The reference audio should be smooth and natural, and the accompanying text must be accurate to enhance the stability and naturalness of the synthesized audio.

⚠️ Usage Disclaimer ❗️❗️❗️❗️❗️❗️

The project incorporates zero-shot voice cloning functionality; Please note that this capability is intended solely for academic research purposes.
DO NOT use this model for ANY illegal activities❗️❗️❗️❗️❗️❗️
The developers assume no liability for any misuse of this model.
If you identify any instances of abuse, misuse, or fraudulent activities related to this project, please report them to our team immediately.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

FireRedTeam
/

FireRedTTS-1S