File size: 4,023 Bytes
a5c23c1 e1e4a7a a5c23c1 e1e4a7a fa1a5d3 e1e4a7a a5c23c1 e1e4a7a a5c23c1 e1e4a7a a5c23c1 e1e4a7a a5c23c1 e1e4a7a a5c23c1 e1e4a7a a5c23c1 e1e4a7a a5c23c1 e1e4a7a 4bd73f1 e1e4a7a 4bd73f1 e1e4a7a a5c23c1 e1e4a7a 4bd73f1 a5c23c1 e1e4a7a a5c23c1 e1e4a7a a5c23c1 e1e4a7a a5c23c1 e1e4a7a a5c23c1 e1e4a7a a5c23c1 e1e4a7a a5c23c1 e1e4a7a a5c23c1 e1e4a7a a5c23c1 e1e4a7a a5c23c1 4bd73f1 a5c23c1 4bd73f1 e1e4a7a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 |
---
license: apache-2.0
tags:
- text-to-speech
- tts
- voice-cloning
- speech-synthesis
- pytorch
- audio
- chinese
- english
- zero-shot
- diffusion
library_name: transformers
pipeline_tag: text-to-speech
---
# MegaTTS3-WaveVAE: Complete Voice Cloning Model
<div align="center">
<h3>π <a href="https://github.com/Saganaki22/MegaTTS3-WaveVAE">GitHub Repository</a></h3>
<img src="https://img.shields.io/github/stars/Saganaki22/MegaTTS3-WaveVAE?style=social" alt="GitHub Stars">
<img src="https://img.shields.io/badge/License-Apache%202.0-blue.svg" alt="License">
<img src="https://img.shields.io/badge/Platform-Windows-blue" alt="Platform">
<img src="https://img.shields.io/badge/Language-Chinese%20%7C%20English-red" alt="Language">
</div>
## About
This is a **complete MegaTTS3 model** with **WaveVAE support** for zero-shot voice cloning. Unlike the original ByteDance release, this includes the full WaveVAE encoder/decoder, enabling direct voice cloning from audio samples.
**Key Features:**
- π― Zero-shot voice cloning from any 3-24 second audio sample
- π Bilingual: Chinese, English, and code-switching
- β‘ Efficient: 0.45B parameter diffusion transformer
- π§ Complete: Includes WaveVAE (missing from original)
- ποΈ Controllable: Adjustable voice similarity and clarity
- π» Windows ready: One-click installer available
## Quick Start
### Installation
**[π₯ One-Click Windows Installer](https://github.com/Saganaki22/MegaTTS3-WaveVAE/releases/tag/Installer)** - Automated setup with GPU detection
Or see [manual installation](https://github.com/Saganaki22/MegaTTS3-WaveVAE#installation) for advanced users.
### Usage Examples
```bash
# Basic voice cloning
python tts/infer_cli.py --input_wav 'reference.wav' --input_text "Your text here" --output_dir ./output
# Better quality settings
python tts/infer_cli.py --input_wav 'reference.wav' --input_text "Your text here" --output_dir ./output --p_w 2.0 --t_w 3.0
# Web interface (easiest)
python tts/megatts3_gradio.py
# Then open http://localhost:7929
```
## Model Components
- **Diffusion Transformer**: 0.45B parameter TTS model
- **WaveVAE**: High-quality audio encoder/decoder
- **Aligner**: Speech-text alignment model
- **G2P**: Grapheme-to-phoneme converter
## Parameters
- `--p_w` (Intelligibility): 1.0-5.0, higher = clearer speech
- `--t_w` (Similarity): 0.0-10.0, higher = more similar to reference
- **Tip**: Set t_w 0-3 points higher than p_w
## Requirements
- Windows 10/11 or Linux
- Python 3.10
- 8GB+ RAM, NVIDIA GPU recommended
- 5GB+ storage space
## Credits
- **Original MegaTTS3**: [ByteDance Research](https://github.com/bytedance/MegaTTS3)
- **WaveVAE Model**: [ACoderPassBy/MegaTTS-SFT](https://modelscope.cn/models/ACoderPassBy/MegaTTS-SFT) [Apache 2.0]
- **Additional Components**: [mrfakename/MegaTTS3-VoiceCloning](https://huggingface.co/mrfakename/MegaTTS3-VoiceCloning)
- **Windows Implementation & Complete Package**: [Saganaki22/MegaTTS3-WaveVAE](https://github.com/Saganaki22/MegaTTS3-WaveVAE)
- **Special Thanks**: MysteryShack on Discord for model information
## Citation
If you use this model, please cite the original research:
```bibtex
@article{jiang2025sparse,
title={Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis},
author={Jiang, Ziyue and Ren, Yi and Li, Ruiqi and Ji, Shengpeng and Ye, Zhenhui and Zhang, Chen and Jionghao, Bai and Yang, Xiaoda and Zuo, Jialong and Zhang, Yu and others},
journal={arXiv preprint arXiv:2502.18924},
year={2025}
}
@article{ji2024wavtokenizer,
title={Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling},
author={Ji, Shengpeng and Jiang, Ziyue and Wang, Wen and Chen, Yifu and Fang, Minghui and Zuo, Jialong and Yang, Qian and Cheng, Xize and Wang, Zehan and Li, Ruiqi and others},
journal={arXiv preprint arXiv:2408.16532},
year={2024}
}
```
---
*High-quality voice cloning for research and creative applications. Please use responsibly.* |