File size: 4,023 Bytes

a5c23c1
 
e1e4a7a
 
 
 
 
 
 
 
 
 
 
 
a5c23c1
 
 
e1e4a7a
fa1a5d3
e1e4a7a
 
 
 
 
 
 
 
a5c23c1
e1e4a7a
a5c23c1
e1e4a7a
a5c23c1
e1e4a7a
 
 
 
 
 
 
a5c23c1
e1e4a7a
a5c23c1
e1e4a7a
 
a5c23c1
e1e4a7a
a5c23c1
e1e4a7a
4bd73f1
 
e1e4a7a
 
4bd73f1
e1e4a7a
 
a5c23c1
e1e4a7a
 
 
4bd73f1
a5c23c1
e1e4a7a
a5c23c1
e1e4a7a
 
 
 
a5c23c1
e1e4a7a
 
 
 
a5c23c1
e1e4a7a
 
 
 
 
a5c23c1
e1e4a7a
a5c23c1
e1e4a7a
 
 
 
 
a5c23c1
e1e4a7a
a5c23c1
e1e4a7a
a5c23c1
e1e4a7a
a5c23c1
 
4bd73f1
a5c23c1
 
 
 
4bd73f1
 
 
 
 
 
e1e4a7a

---
license: apache-2.0
tags:
- text-to-speech
- tts
- voice-cloning
- speech-synthesis
- pytorch
- audio
- chinese
- english
- zero-shot
- diffusion
library_name: transformers
pipeline_tag: text-to-speech
---

# MegaTTS3-WaveVAE: Complete Voice Cloning Model

<div align="center">
  <h3>🚀 <a href="https://github.com/Saganaki22/MegaTTS3-WaveVAE">GitHub Repository</a></h3>
  
  <img src="https://img.shields.io/github/stars/Saganaki22/MegaTTS3-WaveVAE?style=social" alt="GitHub Stars">
  <img src="https://img.shields.io/badge/License-Apache%202.0-blue.svg" alt="License">
  <img src="https://img.shields.io/badge/Platform-Windows-blue" alt="Platform">
  <img src="https://img.shields.io/badge/Language-Chinese%20%7C%20English-red" alt="Language">
</div>

## About

This is a **complete MegaTTS3 model** with **WaveVAE support** for zero-shot voice cloning. Unlike the original ByteDance release, this includes the full WaveVAE encoder/decoder, enabling direct voice cloning from audio samples.

**Key Features:**
- 🎯 Zero-shot voice cloning from any 3-24 second audio sample
- 🌍 Bilingual: Chinese, English, and code-switching
- ⚡ Efficient: 0.45B parameter diffusion transformer
- 🔧 Complete: Includes WaveVAE (missing from original)
- 🎛️ Controllable: Adjustable voice similarity and clarity
- 💻 Windows ready: One-click installer available

## Quick Start

### Installation
**[📥 One-Click Windows Installer](https://github.com/Saganaki22/MegaTTS3-WaveVAE/releases/tag/Installer)** - Automated setup with GPU detection

Or see [manual installation](https://github.com/Saganaki22/MegaTTS3-WaveVAE#installation) for advanced users.

### Usage Examples

```bash
# Basic voice cloning
python tts/infer_cli.py --input_wav 'reference.wav' --input_text "Your text here" --output_dir ./output

# Better quality settings
python tts/infer_cli.py --input_wav 'reference.wav' --input_text "Your text here" --output_dir ./output --p_w 2.0 --t_w 3.0

# Web interface (easiest)
python tts/megatts3_gradio.py
# Then open http://localhost:7929
```

## Model Components

- **Diffusion Transformer**: 0.45B parameter TTS model
- **WaveVAE**: High-quality audio encoder/decoder
- **Aligner**: Speech-text alignment model  
- **G2P**: Grapheme-to-phoneme converter

## Parameters
- `--p_w` (Intelligibility): 1.0-5.0, higher = clearer speech
- `--t_w` (Similarity): 0.0-10.0, higher = more similar to reference
- **Tip**: Set t_w 0-3 points higher than p_w

## Requirements
- Windows 10/11 or Linux
- Python 3.10
- 8GB+ RAM, NVIDIA GPU recommended
- 5GB+ storage space

## Credits

- **Original MegaTTS3**: [ByteDance Research](https://github.com/bytedance/MegaTTS3)
- **WaveVAE Model**: [ACoderPassBy/MegaTTS-SFT](https://modelscope.cn/models/ACoderPassBy/MegaTTS-SFT) [Apache 2.0]
- **Additional Components**: [mrfakename/MegaTTS3-VoiceCloning](https://huggingface.co/mrfakename/MegaTTS3-VoiceCloning)
- **Windows Implementation & Complete Package**: [Saganaki22/MegaTTS3-WaveVAE](https://github.com/Saganaki22/MegaTTS3-WaveVAE)
- **Special Thanks**: MysteryShack on Discord for model information

## Citation

If you use this model, please cite the original research:

```bibtex
@article{jiang2025sparse,
  title={Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis},
  author={Jiang, Ziyue and Ren, Yi and Li, Ruiqi and Ji, Shengpeng and Ye, Zhenhui and Zhang, Chen and Jionghao, Bai and Yang, Xiaoda and Zuo, Jialong and Zhang, Yu and others},
  journal={arXiv preprint arXiv:2502.18924},
  year={2025}
}

@article{ji2024wavtokenizer,
  title={Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling},
  author={Ji, Shengpeng and Jiang, Ziyue and Wang, Wen and Chen, Yifu and Fang, Minghui and Zuo, Jialong and Yang, Qian and Cheng, Xize and Wang, Zehan and Li, Ruiqi and others},
  journal={arXiv preprint arXiv:2408.16532},
  year={2024}
}
```

---
*High-quality voice cloning for research and creative applications. Please use responsibly.*