nielsr HF Staff commited on
Commit
4d46c5a
Β·
verified Β·
1 Parent(s): 347a041

Improve model card for SingingSDS: Add metadata, links, usage, and detailed documentation

Browse files

This Pull Request significantly enhances the model card for **SingingSDS: A Singing-Capable Spoken Dialogue System for Conversational Roleplay Applications**.

Key improvements include:
- **Metadata**:
- Added `pipeline_tag: text-to-audio` for better discoverability on the Hugging Face Hub.
- Updated `license` to `mit`, aligning with the explicit code license found in the GitHub repository's `README.md`.
- Added `library_name: espnet`, reflecting the primary singing voice synthesis library integral to the system's core functionality and consistent with the existing `base_model` tag.
- **Comprehensive Content**:
- Included a clear overview of the project and its capabilities, derived from the paper abstract and GitHub README.
- Provided essential links to the [paper](https://huggingface.co/papers/2511.20972), [GitHub repository](https://github.com/SingingSDS/SingingSDS), and [Hugging Face Space demo](https://huggingface.co/spaces/espnet/SingingSDS), along with a link to the project video playlist.
- Integrated detailed sections for installation, CLI and web usage, configuration options, project structure, and contributing guidelines, all directly sourced from the GitHub README.
- Maintained original formatting, including explicit newline characters (`\n`) in code snippets, as instructed.
- Included a clear breakdown of the project's various licenses (Code, Character Assets, Model Licenses) for full transparency.
- Added a BibTeX citation entry for easy academic reference.

This update transforms the model card into a comprehensive resource for users looking to understand and utilize SingingSDS directly from the Hugging Face Hub.

Files changed (1) hide show
  1. README.md +220 -4
README.md CHANGED
@@ -1,8 +1,224 @@
1
  ---
2
- license: cc-by-4.0
 
3
  language:
4
  - zh
5
  - ja
6
- base_model:
7
- - espnet/mixdata_svs_visinger2_spkemb_lang_pretrained
8
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ base_model:
3
+ - espnet/mixdata_svs_visinger2_spkemb_lang_pretrained
4
  language:
5
  - zh
6
  - ja
7
+ license: mit
8
+ pipeline_tag: text-to-audio
9
+ library_name: espnet
10
+ ---
11
+
12
+ # SingingSDS: A Singing-Capable Spoken Dialogue System for Conversational Roleplay Applications
13
+
14
+ **A role-playing singing dialogue system that converts speech input into character-based singing output.**
15
+
16
+ <div align="center">
17
+
18
+ [![Paper](https://img.shields.io/badge/Paper-2511.20972-orange)](https://huggingface.co/papers/2511.20972) [![Code](https://img.shields.io/badge/Code-GitHub-black)](https://github.com/SingingSDS/SingingSDS) [![HuggingFace Demo](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Demo-yellow)](https://huggingface.co/spaces/espnet/SingingSDS) [![YouTube](https://img.shields.io/badge/YouTube-Playlist-red)](https://www.youtube.com/playlist?list=PLZpUJJbwp2WvtPBenG5D3h09qKIrt24ui)
19
+
20
+ </div>
21
+
22
+ ## πŸ“– Overview
23
+
24
+ SingingSDS is an innovative role-playing singing dialogue system that seamlessly converts natural speech input into character-based singing output. The system integrates automatic speech recognition (ASR), large language models (LLM), and singing voice synthesis (SVS) to create an immersive conversational singing experience. It is a cascaded SDS that responds through singing rather than speaking, fostering more affective, memorable, and pleasurable interactions in character-based roleplay and interactive entertainment scenarios. SingingSDS employs a modular ASR-LLM-SVS pipeline and supports a wide range of configurations across character personas, ASR and LLM backends, SVS models, melody sources, and voice profiles.
25
+
26
+ <div align="center">
27
+ <img src="https://huggingface.co/espnet/SingingSDS/resolve/main/assets/demo.png" alt="SingingSDS Interface" style="max-width: 100%; height: auto;"/>
28
+ <p><em>SingingSDS Web Interface: Interactive singing dialogue system with character visualization, audio I/O, evaluation metrics, and flexible configuration options.</em></p>
29
+ </div>
30
+
31
+ ## πŸš€ Installation
32
+
33
+ ### Requirements
34
+
35
+ - Python 3.10 or 3.11
36
+ - CUDA (optional, for GPU acceleration)
37
+
38
+ ### Install Dependencies
39
+
40
+ #### Option 1: Using Conda (Recommended)
41
+
42
+ ```bash
43
+ conda create -n singingsds python=3.11
44
+
45
+ conda activate singingsds
46
+ conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
47
+ pip install -r requirements.txt
48
+ ```
49
+
50
+ #### Option 2: Using uv (Fast & Modern)
51
+
52
+ First install uv:
53
+
54
+ ```bash
55
+ # On macOS/Linux:
56
+ curl -LsSf https://astral.sh/uv/install.sh | sh
57
+
58
+ # On Windows:
59
+ powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
60
+
61
+ # Or via pip:
62
+ pip install uv
63
+ ```
64
+
65
+ Then install dependencies:
66
+
67
+ ```bash
68
+ uv venv
69
+ source .venv/bin/activate # On Windows: .venv\Scripts\activate
70
+ uv pip install -r requirements.txt
71
+ ```
72
+
73
+ #### Option 3: Using pip only
74
+
75
+ ```bash
76
+ pip install -r requirements.txt
77
+ ```
78
+
79
+ #### Option 4: Using pip with virtual environment
80
+
81
+ ```bash
82
+ python -m venv singingsds_env
83
+
84
+ # On Windows:
85
+ singingsds_env\Scripts\activate
86
+ # On macOS/Linux:
87
+ source singingsds_env/bin/activate
88
+
89
+ pip install -r requirements.txt
90
+ ```
91
+
92
+ ## πŸ’» Usage
93
+
94
+ ### Command Line Interface (CLI)
95
+
96
+ #### Example Usage
97
+
98
+ ```bash
99
+ python cli.py \
100
+ --query_audio tests/audio/hello.wav \
101
+ --config_path config/cli/yaoyin_default.yaml \
102
+ --output_audio outputs/yaoyin_hello.wav \
103
+ --eval_results_csv outputs/yaoyin_test.csv
104
+ ```
105
+
106
+ #### Inference-Only Mode
107
+
108
+ Run minimal inference without evaluation.
109
+
110
+ ```bash
111
+ python cli.py \
112
+ --query_audio tests/audio/hello.wav \
113
+ --config_path config/cli/yaoyin_default_infer_only.yaml \
114
+ --output_audio outputs/yaoyin_hello.wav
115
+ ```
116
+
117
+ #### Parameter Description
118
+
119
+ - `--query_audio`: Input audio file path (required)
120
+ - `--config_path`: Configuration file path (default: config/cli/yaoyin_default.yaml)
121
+ - `--output_audio`: Output audio file path (required)
122
+
123
+ ### 🌐 Web Interface (Gradio)
124
+
125
+ Start the web interface:
126
+
127
+ ```bash
128
+ python app.py
129
+ ```
130
+
131
+ Then visit the displayed address in your browser to use the graphical interface.
132
+
133
+ > πŸ’‘ **Tip**: You can also try our [HuggingFace demo](https://huggingface.co/spaces/espnet/SingingSDS) for a quick test without local installation!
134
+
135
+ ## βš™οΈ Configuration
136
+
137
+ ### Character Configuration
138
+
139
+ The system supports multiple preset characters:
140
+
141
+ - **Yaoyin (ι₯音)**: Default timbre is `timbre2`
142
+ - **Limei (δΈ½ζ’…)**: Default timbre is `timbre1`
143
+
144
+ ### Model Configuration
145
+
146
+ #### ASR Models
147
+ | Model | Description |
148
+ |-------|-------------|
149
+ | `openai/whisper-large-v3-turbo` | Latest Whisper model with turbo optimization |
150
+ | `openai/whisper-large-v3` | Large Whisper v3 model |
151
+ | `openai/whisper-medium` | Medium-sized Whisper model |
152
+ | `openai/whisper-small` | Small Whisper model |
153
+ | `funasr/paraformer-zh` | Paraformer for Chinese ASR |
154
+
155
+ #### LLM Models
156
+ | Model | Description |
157
+ |-------|-------------|
158
+ | `gemini-2.5-flash` | Google Gemini 2.5 Flash |
159
+ | `google/gemma-2-2b` | Google Gemma 2B model |
160
+ | `meta-llama/Llama-3.2-3B-Instruct` | Meta Llama 3.2 3B Instruct |
161
+ | `meta-llama/Llama-3.1-8B-Instruct` | Meta Llama 3.1 8B Instruct |
162
+ | `Qwen/Qwen3-8B` | Qwen3 8B model |
163
+ | `Qwen/Qwen3-30B-A3B` | Qwen3 30B A3B model |
164
+ | `MiniMaxAI/MiniMax-Text-01` | MiniMax Text model |
165
+
166
+ #### SVS Models
167
+ | Model | Language Support |
168
+ |------|------------------|
169
+ | `espnet/visinger2-zh-jp-multisinger-svs` | Bilingual (Chinese & Japanese) |
170
+ | `espnet/aceopencpop_svs_visinger2_40singer_pretrain` | Chinese |
171
+
172
+ ## πŸ“ Project Structure
173
+
174
+ ```
175
+ SingingSDS/
176
+ β”œβ”€β”€ app.py, cli.py # Entry points (demo app & CLI)
177
+ β”œβ”€β”€ pipeline.py # Main orchestration pipeline
178
+ β”œβ”€β”€ interface.py # Gradio interface
179
+ β”œβ”€β”€ characters/ # Virtual character definitions
180
+ β”œβ”€β”€ modules/ # Core modules
181
+ β”‚ β”œβ”€β”€ asr/ # ASR models (Whisper, Paraformer)
182
+ β”‚ β”œβ”€β”€ llm/ # LLMs (Gemini, LLaMA, etc.)
183
+ β”‚ β”œβ”€β”€ svs/ # Singing voice synthesis (ESPnet)
184
+ β”‚ └── utils/ # G2P, text normalization, resources
185
+ β”œβ”€β”€ config/ # YAML configuration files
186
+ β”œβ”€β”€ data/ # Dataset metadata and length info
187
+ β”œβ”€β”€ data_handlers/ # Parsers for KiSing, Touhou, etc.
188
+ β”œβ”€β”€ evaluation/ # Evaluation metrics
189
+ β”œβ”€β”€ resources/ # Singer embeddings, phoneme dicts, MIDI
190
+ β”œβ”€β”€ assets/ # Character visuals
191
+ β”œβ”€β”€ tests/ # Unit tests and sample audios
192
+ └── README.md, requirements.txt
193
+ ```
194
+
195
+ ## 🀝 Contributing
196
+
197
+ We welcome contributions! Please feel free to submit issues and pull requests.
198
+
199
+ ## πŸ“„ License
200
+
201
+ ### Character Assets
202
+
203
+ The Yaoyin (ι₯音) character assets, including [`character_yaoyin.png`](https://huggingface.co/espnet/SingingSDS/resolve/main/assets/character_yaoyin.png) created by illustrator Zihe Zhou, are commissioned exclusively for the SingingSDS project. Screenshots of the system that include these assets, such as [`demo.png`](https://huggingface.co/espnet/SingingSDS/resolve/main/assets/demo.png), are also covered under this license. The assets may be used only for direct derivatives of SingingSDS, such as project-related posts, usage videos, or other content directly depicting the project. Any other use requires express permission from the illustrator, and these assets may not be used for training, fine-tuning, or improving any artificial intelligence or machine learning models. For full license details, see [`assets/character_yaoyin.LICENSE`](https://huggingface.co/espnet/SingingSDS/resolve/main/assets/character_yaoyin.LICENSE).
204
+
205
+ ### Code License
206
+
207
+ All source code in this repository is licensed under the [MIT License](https://github.com/SingingSDS/SingingSDS/blob/main/LICENSE). This license applies **only to the code**. Character assets remain under their separate license and restrictions, as described in the **Character Assets** section.
208
+
209
+ ### Model License
210
+
211
+ The models used in SingingSDS are subject to their respective licenses and terms of use. Users must comply with each model’s official license, which can be found at the respective model’s official repository or website.
212
+
213
+ ## ✏️ Citation
214
+
215
+ If you find our work helpful or inspiring, please feel free to cite it:
216
+
217
+ ```bibtex
218
+ @article{singingsds2024,
219
+ title={SingingSDS: A Singing-Capable Spoken Dialogue System for Conversational Roleplay Applications},
220
+ author={Author list will be added later},
221
+ journal={arXiv preprint arXiv:2511.20972},
222
+ year={2024}
223
+ }
224
+ ```