SingingSDS: A Singing-Capable Spoken Dialogue System for Conversational Roleplay Applications

A role-playing singing dialogue system that converts speech input into character-based singing output.

Paper Code HuggingFace Demo YouTube

πŸ“– Overview

SingingSDS is an innovative role-playing singing dialogue system that seamlessly converts natural speech input into character-based singing output. The system integrates automatic speech recognition (ASR), large language models (LLM), and singing voice synthesis (SVS) to create an immersive conversational singing experience. It is a cascaded SDS that responds through singing rather than speaking, fostering more affective, memorable, and pleasurable interactions in character-based roleplay and interactive entertainment scenarios. SingingSDS employs a modular ASR-LLM-SVS pipeline and supports a wide range of configurations across character personas, ASR and LLM backends, SVS models, melody sources, and voice profiles.

SingingSDS Interface

SingingSDS Web Interface: Interactive singing dialogue system with character visualization, audio I/O, evaluation metrics, and flexible configuration options.

πŸš€ Installation

Requirements

  • Python 3.10 or 3.11
  • CUDA (optional, for GPU acceleration)

Install Dependencies

Option 1: Using Conda (Recommended)

conda create -n singingsds python=3.11

conda activate singingsds
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt

Option 2: Using uv (Fast & Modern)

First install uv:

# On macOS/Linux:
curl -LsSf https://astral.sh/uv/install.sh | sh

# On Windows:
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

# Or via pip:
pip install uv

Then install dependencies:

uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv pip install -r requirements.txt

Option 3: Using pip only

pip install -r requirements.txt

Option 4: Using pip with virtual environment

python -m venv singingsds_env

# On Windows:
singingsds_env\Scripts\activate
# On macOS/Linux:
source singingsds_env/bin/activate

pip install -r requirements.txt

πŸ’» Usage

Command Line Interface (CLI)

Example Usage

python cli.py \
  --query_audio tests/audio/hello.wav \
  --config_path config/cli/yaoyin_default.yaml \
  --output_audio outputs/yaoyin_hello.wav \
  --eval_results_csv outputs/yaoyin_test.csv

Inference-Only Mode

Run minimal inference without evaluation.

python cli.py \
  --query_audio tests/audio/hello.wav \
  --config_path config/cli/yaoyin_default_infer_only.yaml \
  --output_audio outputs/yaoyin_hello.wav

Parameter Description

  • --query_audio: Input audio file path (required)
  • --config_path: Configuration file path (default: config/cli/yaoyin_default.yaml)
  • --output_audio: Output audio file path (required)

🌐 Web Interface (Gradio)

Start the web interface:

python app.py

Then visit the displayed address in your browser to use the graphical interface.

πŸ’‘ Tip: You can also try our HuggingFace demo for a quick test without local installation!

βš™οΈ Configuration

Character Configuration

The system supports multiple preset characters:

  • Yaoyin (ι₯音): Default timbre is timbre2
  • Limei (δΈ½ζ’…): Default timbre is timbre1

Model Configuration

ASR Models

Model Description
openai/whisper-large-v3-turbo Latest Whisper model with turbo optimization
openai/whisper-large-v3 Large Whisper v3 model
openai/whisper-medium Medium-sized Whisper model
openai/whisper-small Small Whisper model
funasr/paraformer-zh Paraformer for Chinese ASR

LLM Models

Model Description
gemini-2.5-flash Google Gemini 2.5 Flash
google/gemma-2-2b Google Gemma 2B model
meta-llama/Llama-3.2-3B-Instruct Meta Llama 3.2 3B Instruct
meta-llama/Llama-3.1-8B-Instruct Meta Llama 3.1 8B Instruct
Qwen/Qwen3-8B Qwen3 8B model
Qwen/Qwen3-30B-A3B Qwen3 30B A3B model
MiniMaxAI/MiniMax-Text-01 MiniMax Text model

SVS Models

Model Language Support
espnet/visinger2-zh-jp-multisinger-svs Bilingual (Chinese & Japanese)
espnet/aceopencpop_svs_visinger2_40singer_pretrain Chinese

πŸ“ Project Structure

SingingSDS/
β”œβ”€β”€ app.py, cli.py               # Entry points (demo app & CLI)
β”œβ”€β”€ pipeline.py                  # Main orchestration pipeline
β”œβ”€β”€ interface.py                 # Gradio interface
β”œβ”€β”€ characters/                  # Virtual character definitions
β”œβ”€β”€ modules/                     # Core modules
β”‚   β”œβ”€β”€ asr/                     # ASR models (Whisper, Paraformer)
β”‚   β”œβ”€β”€ llm/                     # LLMs (Gemini, LLaMA, etc.)
β”‚   β”œβ”€β”€ svs/                     # Singing voice synthesis (ESPnet)
β”‚   └── utils/                   # G2P, text normalization, resources
β”œβ”€β”€ config/                      # YAML configuration files 
β”œβ”€β”€ data/                        # Dataset metadata and length info
β”œβ”€β”€ data_handlers/               # Parsers for KiSing, Touhou, etc.
β”œβ”€β”€ evaluation/                  # Evaluation metrics
β”œβ”€β”€ resources/                   # Singer embeddings, phoneme dicts, MIDI
β”œβ”€β”€ assets/                      # Character visuals
β”œβ”€β”€ tests/                       # Unit tests and sample audios
└── README.md, requirements.txt

🀝 Contributing

We welcome contributions! Please feel free to submit issues and pull requests.

πŸ“„ License

Character Assets

The Yaoyin (ι₯音) character assets, including character_yaoyin.png created by illustrator Zihe Zhou, are commissioned exclusively for the SingingSDS project. Screenshots of the system that include these assets, such as demo.png, are also covered under this license. The assets may be used only for direct derivatives of SingingSDS, such as project-related posts, usage videos, or other content directly depicting the project. Any other use requires express permission from the illustrator, and these assets may not be used for training, fine-tuning, or improving any artificial intelligence or machine learning models. For full license details, see assets/character_yaoyin.LICENSE.

Code License

All source code in this repository is licensed under the MIT License. This license applies only to the code. Character assets remain under their separate license and restrictions, as described in the Character Assets section.

Model License

The models used in SingingSDS are subject to their respective licenses and terms of use. Users must comply with each model’s official license, which can be found at the respective model’s official repository or website.

✏️ Citation

If you find our work helpful or inspiring, please feel free to cite it:

@article{singingsds2024,
      title={SingingSDS: A Singing-Capable Spoken Dialogue System for Conversational Roleplay Applications},
      author={Author list will be added later},
      journal={arXiv preprint arXiv:2511.20972},
      year={2024}
}
Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for espnet/mixdata_svs_visinger2_spkemb_lang_pretrained

Unable to build the model tree, the base model loops to the model itself. Learn more.

Spaces using espnet/mixdata_svs_visinger2_spkemb_lang_pretrained 3