EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting

Overview

EmoVoice is a emotion-controllable TTS model that exploits large language models (LLMs) to enable fine-grained freestyle natural language emotion control. EmoVoice achieves SOTA performance on English EmoVoice-DB and Chinese Secap test sets.

Checkpoints

English model checkpoints of EmoVoice(0.5B), EmoVoice(1.5B) and EmoVoice-PP(0.5B) are uploaded.
Qwen2.5-0.5B-phn, the Qwen2.5-0.5B tokenizer with a phoneme-extended vocabulary, is uploaded.

Citation

If our work is useful for you, please cite as:

@article{yang2025emovoice,
  title={EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting},
  author={Yang, Guanrou and Yang, Chen and Chen, Qian and Ma, Ziyang and Chen, Wenxi and Wang, Wen and Wang, Tianrui and Yang, Yifan and Niu, Zhikang and Liu, Wenrui and others},
  journal={arXiv preprint arXiv:2504.12867},
  year={2025}
}

Downloads last month: 29

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support