MegaTTS3-WaveVAE: Complete Voice Cloning Model

About

This is a complete MegaTTS3 model with WaveVAE support for zero-shot voice cloning. Unlike the original ByteDance release, this includes the full WaveVAE encoder/decoder, enabling direct voice cloning from audio samples.

Key Features:

  • 🎯 Zero-shot voice cloning from any 3-24 second audio sample
  • 🌍 Bilingual: Chinese, English, and code-switching
  • ⚑ Efficient: 0.45B parameter diffusion transformer
  • πŸ”§ Complete: Includes WaveVAE (missing from original)
  • πŸŽ›οΈ Controllable: Adjustable voice similarity and clarity
  • πŸ’» Windows ready: One-click installer available

Quick Start

Installation

πŸ“₯ One-Click Windows Installer - Automated setup with GPU detection

Or see manual installation for advanced users.

Usage Examples

# Basic voice cloning
python tts/infer_cli.py --input_wav 'reference.wav' --input_text "Your text here" --output_dir ./output

# Better quality settings
python tts/infer_cli.py --input_wav 'reference.wav' --input_text "Your text here" --output_dir ./output --p_w 2.0 --t_w 3.0

# Web interface (easiest)
python tts/megatts3_gradio.py
# Then open http://localhost:7929

Model Components

  • Diffusion Transformer: 0.45B parameter TTS model
  • WaveVAE: High-quality audio encoder/decoder
  • Aligner: Speech-text alignment model
  • G2P: Grapheme-to-phoneme converter

Parameters

  • --p_w (Intelligibility): 1.0-5.0, higher = clearer speech
  • --t_w (Similarity): 0.0-10.0, higher = more similar to reference
  • Tip: Set t_w 0-3 points higher than p_w

Requirements

  • Windows 10/11 or Linux
  • Python 3.10
  • 8GB+ RAM, NVIDIA GPU recommended
  • 5GB+ storage space

Credits

Citation

If you use this model, please cite the original research:

@article{jiang2025sparse,
  title={Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis},
  author={Jiang, Ziyue and Ren, Yi and Li, Ruiqi and Ji, Shengpeng and Ye, Zhenhui and Zhang, Chen and Jionghao, Bai and Yang, Xiaoda and Zuo, Jialong and Zhang, Yu and others},
  journal={arXiv preprint arXiv:2502.18924},
  year={2025}
}

@article{ji2024wavtokenizer,
  title={Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling},
  author={Ji, Shengpeng and Jiang, Ziyue and Wang, Wen and Chen, Yifu and Fang, Minghui and Zuo, Jialong and Yang, Qian and Cheng, Xize and Wang, Zehan and Li, Ruiqi and others},
  journal={arXiv preprint arXiv:2408.16532},
  year={2024}
}

High-quality voice cloning for research and creative applications. Please use responsibly.

Downloads last month
12
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support