Spaces:
Running
on
Zero
Running
on
Zero
File size: 5,076 Bytes
89d56a3 e7ef00b 89d56a3 5f904c3 d83b1b3 5f904c3 1611a5c fc40b49 89d56a3 46617a5 89d56a3 46617a5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 |
---
title: Voice Clone TTS
emoji: π
colorFrom: green
colorTo: purple
sdk: gradio
sdk_version: 5.41.1
app_file: app.py
pinned: true
short_description: mcp_server
---
Looking at this code, it's a Text-to-Speech (TTS) application using the Zonos model. Let me provide explanations in both English and Korean.
## English Explanation
### Overview
This is a Gradio-based web application for the **Zonos Text-to-Speech (TTS) Generator**. Zonos is an advanced TTS model from Zyphra that can generate natural-sounding speech with customizable voice characteristics.
### Key Features
1. **Model Selection**
- Two model variants: Transformer and Hybrid
- Different models have different conditioning capabilities
2. **Text Input & Language Support**
- Supports multiple languages through eSpeak phoneme conversion
- Text length limit of 500 characters
- Language selection from supported language codes
3. **Voice Customization**
- **Speaker Cloning**: Upload audio to clone a specific voice
- **Voice Quality Settings**:
- DNS-MOS (Voice Quality): 1.0-5.0 scale
- Frequency Max: Control the highest frequency in Hz
- Voice Clarity: Adjust voice intelligibility
- Pitch Variation: Control how much the pitch varies
- Speaking Rate: Adjust speech speed
4. **Emotion Control**
- 8 emotion sliders: Happiness, Sadness, Disgust, Fear, Surprise, Anger, Other, Neutral
- Fine-tune emotional expression in the generated speech
5. **Advanced Generation Parameters**
- **Guidance Scale**: Controls how closely the model follows the conditioning
- **Min P**: Controls randomness/creativity in generation
- **Seed**: For reproducible results
- **Prefix Audio**: Continue generation from existing audio
6. **Unconditional Generation**
- Toggle specific conditions to let the model generate them automatically
- Useful for more creative/varied outputs
### Technical Details
- Uses GPU acceleration via CUDA
- Implements classifier-free guidance for better control
- Supports audio continuation from prefix
- Real-time progress tracking during generation
### How to Use
1. Select a model variant
2. Enter your text and choose language
3. (Optional) Upload speaker audio for voice cloning
4. Adjust voice characteristics and emotions
5. Click "Generate Audio" to create speech
6. Download or play the generated audio
---
## νκΈ μ€λͺ
### κ°μ
μ΄κ²μ **Zonos ν
μ€νΈ μμ± λ³ν(TTS) μμ±κΈ°**λ₯Ό μν Gradio κΈ°λ° μΉ μ ν리μΌμ΄μ
μ
λλ€. Zonosλ Zyphraμμ κ°λ°ν κ³ κΈ TTS λͺ¨λΈλ‘, μ¬μ©μκ° μμ± νΉμ±μ 컀μ€ν°λ§μ΄μ§νμ¬ μμ°μ€λ¬μ΄ μμ±μ μμ±ν μ μμ΅λλ€.
### μ£Όμ κΈ°λ₯
1. **λͺ¨λΈ μ ν**
- λ κ°μ§ λͺ¨λΈ λ³ν: Transformerμ Hybrid
- κ° λͺ¨λΈλ§λ€ λ€λ₯Έ μ‘°κ±΄λΆ κΈ°λ₯ μ 곡
2. **ν
μ€νΈ μ
λ ₯ λ° μΈμ΄ μ§μ**
- eSpeak μμ λ³νμ ν΅ν λ€κ΅μ΄ μ§μ
- ν
μ€νΈ κΈΈμ΄ μ ν: 500μ
- μ§μλλ μΈμ΄ μ½λ μ€ μ ν κ°λ₯
3. **μμ± μ»€μ€ν°λ§μ΄μ§**
- **νμ 볡μ **: νΉμ μμ±μ 볡μ νκΈ° μν μ€λμ€ μ
λ‘λ
- **μμ± νμ§ μ€μ **:
- DNS-MOS (μμ± νμ§): 1.0-5.0 μ²λ
- μ΅λ μ£Όνμ: Hz λ¨μλ‘ μ΅κ³ μ£Όνμ μ μ΄
- μμ± λͺ
λ£λ: μμ±μ μ΄ν΄λ μ‘°μ
- μλμ΄ λ³ν: μλμ΄ λ³νλ μ μ΄
- λ°ν μλ: μμ± μλ μ‘°μ
4. **κ°μ μ μ΄**
- 8κ°μ§ κ°μ μ¬λΌμ΄λ: ν볡, μ¬ν, νμ€, λλ €μ, λλ, λΆλ
Έ, κΈ°ν, μ€λ¦½
- μμ±λ μμ±μ κ°μ ννμ μΈλ°νκ² μ‘°μ
5. **κ³ κΈ μμ± λ§€κ°λ³μ**
- **κ°μ΄λμ€ μ€μΌμΌ**: λͺ¨λΈμ΄ 쑰건μ μΌλ§λ μΆ©μ€ν λ°λ₯Όμ§ μ μ΄
- **Min P**: μμ±μ 무μμμ±/μ°½μμ± μ μ΄
- **μλ**: μ¬ν κ°λ₯ν κ²°κ³Όλ₯Ό μν μ€μ
- **ν리ν½μ€ μ€λμ€**: κΈ°μ‘΄ μ€λμ€μμ μ΄μ΄μ μμ±
6. **λ¬΄μ‘°κ±΄λΆ μμ±**
- νΉμ 쑰건μ ν κΈνμ¬ λͺ¨λΈμ΄ μλμΌλ‘ μμ±νλλ‘ μ€μ
- λ μ°½μμ μ΄κ³ λ€μν μΆλ ₯μ μ μ©
### κΈ°μ μ μΈλΆμ¬ν
- CUDAλ₯Ό ν΅ν GPU κ°μ μ¬μ©
- λ λμ μ μ΄λ₯Ό μν classifier-free guidance ꡬν
- ν리ν½μ€μμ μ€λμ€ μ°μ μμ± μ§μ
- μμ± μ€ μ€μκ° μ§ν μν© μΆμ
### μ¬μ© λ°©λ²
1. λͺ¨λΈ λ³ν μ ν
2. ν
μ€νΈ μ
λ ₯ λ° μΈμ΄ μ ν
3. (μ νμ¬ν) μμ± λ³΅μ λ₯Ό μν νμ μ€λμ€ μ
λ‘λ
4. μμ± νΉμ± λ° κ°μ μ‘°μ
5. "Generate Audio" λ²νΌμ ν΄λ¦νμ¬ μμ± μμ±
6. μμ±λ μ€λμ€ λ€μ΄λ‘λ λλ μ¬μ
### νΉλ³ κΈ°λ₯
- **κ°μ μ€μ **: μμ±λ μμ±μ κ°μ ν€μ μΈλ°νκ² μ μ΄
- **μμ± νμ§**: DNS-MOS μ μλ‘ μμ± νμ§ μ‘°μ
- **νμ λ
Έμ΄μ¦ μ κ±°**: μ
λ‘λλ νμ μ€λμ€μ λ
Έμ΄μ¦ μ κ±° μ΅μ
- **λ¬΄μ‘°κ±΄λΆ ν€**: νΉμ κΈ°λ₯μ μλμΌλ‘ μμ±νλλ‘ μ€μ
μ΄ μ ν리μΌμ΄μ
μ κ³ νμ§ TTS μμ±μ μν κ°λ ₯νκ³ μ μ°ν λꡬλ‘, λ€μν μ©λμ μμ± μ½ν
μΈ μ μμ νμ©ν μ μμ΅λλ€. |