drbaph commited on
Commit
e1e4a7a
·
verified ·
1 Parent(s): fa1a5d3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +69 -122
README.md CHANGED
@@ -1,151 +1,95 @@
1
  ---
2
- language:
3
- - en
4
- - zh
5
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
6
  pipeline_tag: text-to-speech
7
  ---
8
 
9
- Unofficial implementation version, including full parameters. 非官方实现版本,包括全量参数。
10
 
 
 
 
 
 
 
 
 
11
 
 
12
 
13
- # Model Description
14
- This is a ModelScope model card for MegaTTS 3 👋
15
 
16
- - Paper: [MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis](https://huggingface.co/papers/2502.18924)
17
- - Project Page (Audio Samples): <https://sditdemo.github.io/sditdemo/>
18
- - github: <https://github.com/bytedance/MegaTTS3>
19
- - [Demo Video](https://github.com/user-attachments/assets/0174c111-f392-4376-a34b-0b5b8164aacc)
 
 
 
20
 
 
21
 
22
- ## Installation
 
23
 
24
- ```sh
25
- # Clone the repository
26
- git clone https://github.com/bytedance/MegaTTS3
27
- cd MegaTTS3
28
- ```
29
-
30
- **Model Download**
31
-
32
- ```sh
33
- modelscope download --model ACoderPassBy/MegaTTS-SFT --local_dir ./checkpoints
34
- ```
35
-
36
- **Requirements (for Linux)**
37
-
38
- ```sh
39
- # Create a python 3.10 conda env (you could also use virtualenv)
40
- conda create -n megatts3-env python=3.10
41
- conda activate megatts3-env
42
- pip install -r requirements.txt
43
-
44
- # Set the root directory
45
- export PYTHONPATH="/path/to/MegaTTS3:$PYTHONPATH"
46
-
47
- # [Optional] Set GPU
48
- export CUDA_VISIBLE_DEVICES=0
49
-
50
- # If you encounter bugs with pydantic in inference, you should check if the versions of pydantic and gradio are matched.
51
- # [Note] if you encounter bugs related with httpx, please check that whether your environmental variable "no_proxy" has patterns like "::"
52
- ```
53
-
54
- **Requirements (for Windows)**
55
-
56
- ```sh
57
- # [The Windows version is currently under testing]
58
- # Comment below dependence in requirements.txt:
59
- # # WeTextProcessing==1.0.4.1
60
-
61
- # Create a python 3.10 conda env (you could also use virtualenv)
62
- conda create -n megatts3-env python=3.10
63
- conda activate megatts3-env
64
- pip install -r requirements.txt
65
- conda install -y -c conda-forge pynini==2.1.5
66
- pip install WeTextProcessing==1.0.3
67
-
68
- # [Optional] If you want GPU inference, you may need to install specific version of PyTorch for your GPU from https://pytorch.org/.
69
- pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
70
-
71
- # [Note] if you encounter bugs related with `ffprobe` or `ffmpeg`, you can install it through `conda install -c conda-forge ffmpeg`
72
-
73
- # Set environment variable for root directory
74
- set PYTHONPATH="C:\path\to\MegaTTS3;%PYTHONPATH%" # Windows
75
- $env:PYTHONPATH="C:\path\to\MegaTTS3;%PYTHONPATH%" # Powershell on Windows
76
- conda env config vars set PYTHONPATH="C:\path\to\MegaTTS3;%PYTHONPATH%" # For conda users
77
-
78
- # [Optional] Set GPU
79
- set CUDA_VISIBLE_DEVICES=0 # Windows
80
- $env:CUDA_VISIBLE_DEVICES=0 # Powershell on Windows
81
- ```
82
 
83
- **Requirements (for Docker)**
84
-
85
- ```sh
86
- # [The Docker version is currently under testing]
87
- # ! You should download the pretrained checkpoint before running the following command
88
- docker build . -t megatts3:latest
89
-
90
- # For GPU inference
91
- docker run -it -p 7929:7929 --gpus all -e CUDA_VISIBLE_DEVICES=0 megatts3:latest
92
- # For CPU inference
93
- docker run -it -p 7929:7929 megatts3:latest
94
-
95
- # Visit http://0.0.0.0:7929/ for gradio.
96
- ```
97
-
98
- > \[!TIP]
99
- > \[IMPORTANT]
100
- > 非官方版本
101
-
102
- ## Inference
103
-
104
- **Command-Line Usage (Standard)**
105
 
106
  ```bash
107
- # p_w (intelligibility weight), t_w (similarity weight). Typically, prompt with more noises requires higher p_w and t_w
108
- python tts/infer_cli.py --input_wav 'assets/Chinese_prompt.wav' --input_text "另��边的桌上,一位读书人嗤之以鼻道,'佛子三藏,神子燕小鱼是什么样的人物,李家的那个李子夜如何与他们相提并论?'" --output_dir ./gen
109
-
110
- # As long as audio volume and pronunciation are appropriate, increasing --t_w within reasonable ranges (2.0~5.0)
111
- # will increase the generated speech's expressiveness and similarity (especially for some emotional cases).
112
- python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text 'As his long promised tariff threat turned into reality this week, top human advisers began fielding a wave of calls from business leaders, particularly in the automotive sector, along with lawmakers who were sounding the alarm.' --output_dir ./gen --p_w 2.0 --t_w 3.0
113
- ```
114
 
115
- **Command-Line Usage (for TTS with Accents)**
 
116
 
117
- ```bash
118
- # When p_w (intelligibility weight) ≈ 1.0, the generated audio closely retains the speaker’s original accent. As p_w increases, it shifts toward standard pronunciation.
119
- # t_w (similarity weight) is typically set 0–3 points higher than p_w for optimal results.
120
- # Useful for accented TTS or solving the accent problems in cross-lingual TTS.
121
- python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text '这是一条有口音的音频。' --output_dir ./gen --p_w 1.0 --t_w 3.0
122
-
123
- python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text '这条音频的发音标准一些了吗?' --output_dir ./gen --p_w 2.5 --t_w 2.5
124
  ```
125
 
126
- **Web UI Usage**
127
-
128
- ```bash
129
- # We also support cpu inference, but it may take about 30 seconds (for 10 inference steps).
130
- python tts/gradio_api.py
131
- ```
132
 
133
- ## Security
 
 
 
134
 
135
- If you discover a potential security issue in this project, or think you may
136
- have discovered a security issue, we ask that you notify Bytedance Security via our [security center](https://security.bytedance.com/src) or [[email protected]]([email protected]).
 
 
137
 
138
- Please do **not** create a public issue.
 
 
 
 
139
 
140
- ## License
141
 
142
- This project is licensed under the [Apache-2.0 License](LICENSE).
 
 
 
 
143
 
144
- ## BibTeX Entry and Citation Info
145
 
146
- This repo contains forced-align version of `Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis` and the WavVAE is mainly based on `Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling`. Compared to the model described in paper, the repository includes additional models. These models not only enhance the stability and cloning capabilities of the algorithm but can also be independently utilized to serve a wider range of scenarios.
147
 
148
- ```
149
  @article{jiang2025sparse,
150
  title={Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis},
151
  author={Jiang, Ziyue and Ren, Yi and Li, Ruiqi and Ji, Shengpeng and Ye, Zhenhui and Zhang, Chen and Jionghao, Bai and Yang, Xiaoda and Zuo, Jialong and Zhang, Yu and others},
@@ -159,4 +103,7 @@ This repo contains forced-align version of `Sparse Alignment Enhanced Latent Dif
159
  journal={arXiv preprint arXiv:2408.16532},
160
  year={2024}
161
  }
162
- ```
 
 
 
 
1
  ---
 
 
 
2
  license: apache-2.0
3
+ tags:
4
+ - text-to-speech
5
+ - tts
6
+ - voice-cloning
7
+ - speech-synthesis
8
+ - pytorch
9
+ - audio
10
+ - chinese
11
+ - english
12
+ - zero-shot
13
+ - diffusion
14
+ library_name: transformers
15
  pipeline_tag: text-to-speech
16
  ---
17
 
18
+ # MegaTTS3-WaveVAE: Complete Voice Cloning Model
19
 
20
+ <div align="center">
21
+ <h3>🚀 <a href="https://github.com/Saganaki22/MegaTTS3-WaveVAE">GitHub Repository</a></h3>
22
+
23
+ <img src="https://img.shields.io/github/stars/Saganaki22/MegaTTS3-WaveVAE?style=social" alt="GitHub Stars">
24
+ <img src="https://img.shields.io/badge/License-Apache%202.0-blue.svg" alt="License">
25
+ <img src="https://img.shields.io/badge/Platform-Windows-blue" alt="Platform">
26
+ <img src="https://img.shields.io/badge/Language-Chinese%20%7C%20English-red" alt="Language">
27
+ </div>
28
 
29
+ ## About
30
 
31
+ This is a **complete MegaTTS3 model** with **WaveVAE support** for zero-shot voice cloning. Unlike the original ByteDance release, this includes the full WaveVAE encoder/decoder, enabling direct voice cloning from audio samples.
 
32
 
33
+ **Key Features:**
34
+ - 🎯 Zero-shot voice cloning from any 3-24 second audio sample
35
+ - 🌍 Bilingual: Chinese, English, and code-switching
36
+ - ⚡ Efficient: 0.45B parameter diffusion transformer
37
+ - 🔧 Complete: Includes WaveVAE (missing from original)
38
+ - 🎛️ Controllable: Adjustable voice similarity and clarity
39
+ - 💻 Windows ready: One-click installer available
40
 
41
+ ## Quick Start
42
 
43
+ ### Installation
44
+ **[📥 One-Click Windows Installer](https://github.com/Saganaki22/MegaTTS3-WaveVAE/releases/tag/Installer)** - Automated setup with GPU detection
45
 
46
+ Or see [manual installation](https://github.com/Saganaki22/MegaTTS3-WaveVAE#installation) for advanced users.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
 
48
+ ### Usage Examples
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
50
  ```bash
51
+ # Basic voice cloning
52
+ python tts/infer_cli.py --input_wav 'reference.wav' --input_text "Your text here" --output_dir ./output
 
 
 
 
 
53
 
54
+ # Better quality settings
55
+ python tts/infer_cli.py --input_wav 'reference.wav' --input_text "Your text here" --output_dir ./output --p_w 2.0 --t_w 3.0
56
 
57
+ # Web interface (easiest)
58
+ python tts/megatts3_gradio.py
59
+ # Then open http://localhost:7929
 
 
 
 
60
  ```
61
 
62
+ ## Model Components
 
 
 
 
 
63
 
64
+ - **Diffusion Transformer**: 0.45B parameter TTS model
65
+ - **WaveVAE**: High-quality audio encoder/decoder
66
+ - **Aligner**: Speech-text alignment model
67
+ - **G2P**: Grapheme-to-phoneme converter
68
 
69
+ ## Parameters
70
+ - `--p_w` (Intelligibility): 1.0-5.0, higher = clearer speech
71
+ - `--t_w` (Similarity): 0.0-10.0, higher = more similar to reference
72
+ - **Tip**: Set t_w 0-3 points higher than p_w
73
 
74
+ ## Requirements
75
+ - Windows 10/11 or Linux
76
+ - Python 3.10
77
+ - 8GB+ RAM, NVIDIA GPU recommended
78
+ - 5GB+ storage space
79
 
80
+ ## Credits
81
 
82
+ - **Original MegaTTS3**: [ByteDance Research](https://github.com/bytedance/MegaTTS3)
83
+ - **WaveVAE Model**: [ACoderPassBy/MegaTTS-SFT](https://modelscope.cn/models/ACoderPassBy/MegaTTS-SFT) [Apache 2.0]
84
+ - **Additional Components**: [mrfakename/MegaTTS3-VoiceCloning](https://huggingface.co/mrfakename/MegaTTS3-VoiceCloning)
85
+ - **Windows Implementation & Complete Package**: [Saganaki22/MegaTTS3-WaveVAE](https://github.com/Saganaki22/MegaTTS3-WaveVAE)
86
+ - **Special Thanks**: MysteryShack on Discord for model information
87
 
88
+ ## Citation
89
 
90
+ If you use this model, please cite the original research:
91
 
92
+ ```bibtex
93
  @article{jiang2025sparse,
94
  title={Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis},
95
  author={Jiang, Ziyue and Ren, Yi and Li, Ruiqi and Ji, Shengpeng and Ye, Zhenhui and Zhang, Chen and Jionghao, Bai and Yang, Xiaoda and Zuo, Jialong and Zhang, Yu and others},
 
103
  journal={arXiv preprint arXiv:2408.16532},
104
  year={2024}
105
  }
106
+ ```
107
+
108
+ ---
109
+ *High-quality voice cloning for research and creative applications. Please use responsibly.*