Update README.md
Browse files
README.md
CHANGED
|
@@ -10,6 +10,15 @@ tags:
|
|
| 10 |
|
| 11 |
## VibeVoice: A Frontier Open-Source Text-to-Speech Model
|
| 12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
VibeVoice is a novel framework designed for generating expressive, long-form, multi-speaker conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking.
|
| 14 |
|
| 15 |
A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz. These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences. VibeVoice employs a next-token diffusion framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details.
|
|
|
|
| 10 |
|
| 11 |
## VibeVoice: A Frontier Open-Source Text-to-Speech Model
|
| 12 |
|
| 13 |
+
> This repository contains a copy of model weights obtained from ModelScope([microsoft/VibeVoice-Large](https://www.modelscope.cn/models/microsoft/VibeVoice-Large)).
|
| 14 |
+
> The license for this model is the `MIT License`, **which permits redistribution**.
|
| 15 |
+
>
|
| 16 |
+
> My understanding of the MIT License, which is consistent with the broader open-source community's consensus,
|
| 17 |
+
> is that it grants the right to distribute copies of the software and its derivatives.
|
| 18 |
+
> Therefore, I am lawfully exercising the right to redistribute this model.
|
| 19 |
+
>
|
| 20 |
+
> If you are a rights holder and believe this understanding of the license is incorrect, please submit a DMCA complaint to Hugging Face at [email protected]_
|
| 21 |
+
|
| 22 |
VibeVoice is a novel framework designed for generating expressive, long-form, multi-speaker conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking.
|
| 23 |
|
| 24 |
A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz. These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences. VibeVoice employs a next-token diffusion framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details.
|