VoiceClone-TTS

Running on Zero

File size: 5,076 Bytes

89d56a3
e7ef00b
 
89d56a3
 
5f904c3
d83b1b3
5f904c3
1611a5c
fc40b49
89d56a3
46617a5
89d56a3
46617a5

---
title: Voice Clone TTS
emoji: 🏆
colorFrom: green
colorTo: purple
sdk: gradio
sdk_version: 5.41.1
app_file: app.py
pinned: true
short_description: mcp_server
---
Looking at this code, it's a Text-to-Speech (TTS) application using the Zonos model. Let me provide explanations in both English and Korean.

## English Explanation

### Overview
This is a Gradio-based web application for the **Zonos Text-to-Speech (TTS) Generator**. Zonos is an advanced TTS model from Zyphra that can generate natural-sounding speech with customizable voice characteristics.

### Key Features

1. **Model Selection**
   - Two model variants: Transformer and Hybrid
   - Different models have different conditioning capabilities

2. **Text Input & Language Support**
   - Supports multiple languages through eSpeak phoneme conversion
   - Text length limit of 500 characters
   - Language selection from supported language codes

3. **Voice Customization**
   - **Speaker Cloning**: Upload audio to clone a specific voice
   - **Voice Quality Settings**:
     - DNS-MOS (Voice Quality): 1.0-5.0 scale
     - Frequency Max: Control the highest frequency in Hz
     - Voice Clarity: Adjust voice intelligibility
     - Pitch Variation: Control how much the pitch varies
     - Speaking Rate: Adjust speech speed

4. **Emotion Control**
   - 8 emotion sliders: Happiness, Sadness, Disgust, Fear, Surprise, Anger, Other, Neutral
   - Fine-tune emotional expression in the generated speech

5. **Advanced Generation Parameters**
   - **Guidance Scale**: Controls how closely the model follows the conditioning
   - **Min P**: Controls randomness/creativity in generation
   - **Seed**: For reproducible results
   - **Prefix Audio**: Continue generation from existing audio

6. **Unconditional Generation**
   - Toggle specific conditions to let the model generate them automatically
   - Useful for more creative/varied outputs

### Technical Details
- Uses GPU acceleration via CUDA
- Implements classifier-free guidance for better control
- Supports audio continuation from prefix
- Real-time progress tracking during generation

### How to Use
1. Select a model variant
2. Enter your text and choose language
3. (Optional) Upload speaker audio for voice cloning
4. Adjust voice characteristics and emotions
5. Click "Generate Audio" to create speech
6. Download or play the generated audio

---

## 한글 설명

### 개요
이것은 **Zonos 텍스트 음성 변환(TTS) 생성기**를 위한 Gradio 기반 웹 애플리케이션입니다. Zonos는 Zyphra에서 개발한 고급 TTS 모델로, 사용자가 음성 특성을 커스터마이징하여 자연스러운 음성을 생성할 수 있습니다.

### 주요 기능

1. **모델 선택**
   - 두 가지 모델 변형: Transformer와 Hybrid
   - 각 모델마다 다른 조건부 기능 제공

2. **텍스트 입력 및 언어 지원**
   - eSpeak 음소 변환을 통한 다국어 지원
   - 텍스트 길이 제한: 500자
   - 지원되는 언어 코드 중 선택 가능

3. **음성 커스터마이징**
   - **화자 복제**: 특정 음성을 복제하기 위한 오디오 업로드
   - **음성 품질 설정**:
     - DNS-MOS (음성 품질): 1.0-5.0 척도
     - 최대 주파수: Hz 단위로 최고 주파수 제어
     - 음성 명료도: 음성의 이해도 조정
     - 음높이 변화: 음높이 변화량 제어
     - 발화 속도: 음성 속도 조정

4. **감정 제어**
   - 8가지 감정 슬라이더: 행복, 슬픔, 혐오, 두려움, 놀람, 분노, 기타, 중립
   - 생성된 음성의 감정 표현을 세밀하게 조정

5. **고급 생성 매개변수**
   - **가이던스 스케일**: 모델이 조건을 얼마나 충실히 따를지 제어
   - **Min P**: 생성의 무작위성/창의성 제어
   - **시드**: 재현 가능한 결과를 위한 설정
   - **프리픽스 오디오**: 기존 오디오에서 이어서 생성

6. **무조건부 생성**
   - 특정 조건을 토글하여 모델이 자동으로 생성하도록 설정
   - 더 창의적이고 다양한 출력에 유용

### 기술적 세부사항
- CUDA를 통한 GPU 가속 사용
- 더 나은 제어를 위한 classifier-free guidance 구현
- 프리픽스에서 오디오 연속 생성 지원
- 생성 중 실시간 진행 상황 추적

### 사용 방법
1. 모델 변형 선택
2. 텍스트 입력 및 언어 선택
3. (선택사항) 음성 복제를 위한 화자 오디오 업로드
4. 음성 특성 및 감정 조정
5. "Generate Audio" 버튼을 클릭하여 음성 생성
6. 생성된 오디오 다운로드 또는 재생

### 특별 기능
- **감정 설정**: 생성된 음성의 감정 톤을 세밀하게 제어
- **음성 품질**: DNS-MOS 점수로 음성 품질 조정
- **화자 노이즈 제거**: 업로드된 화자 오디오의 노이즈 제거 옵션
- **무조건부 키**: 특정 기능을 자동으로 생성하도록 설정

이 애플리케이션은 고품질 TTS 생성을 위한 강력하고 유연한 도구로, 다양한 용도의 음성 콘텐츠 제작에 활용할 수 있습니다.