Papers
arxiv:2507.09862

SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

Published on Jul 14
· Submitted by dorni on Jul 15
#3 Paper of the day
Authors:
,
,
,
,
,
,
,

Abstract

A large-scale dataset named SpeakerVid-5M is introduced for audio-visual dyadic interactive virtual human generation, featuring diverse interactions and high-quality data for various virtual human tasks.

AI-generated summary

The rapid development of large-scale models has catalyzed significant breakthroughs in the digital human domain. These advanced methodologies offer high-fidelity solutions for avatar driving and rendering, leading academia to focus on the next major challenge: audio-visual dyadic interactive virtual human. To facilitate research in this emerging area, we present SpeakerVid-5M dataset, the first large-scale, high-quality dataset designed for audio-visual dyadic interactive virtual human generation. Totaling over 8,743 hours, SpeakerVid-5M contains more than 5.2 million video clips of human portraits. It covers diverse scales and interaction types, including monadic talking, listening, and dyadic conversations. Crucially, the dataset is structured along two key dimensions: interaction type and data quality. First, it is categorized into four types (dialogue branch, single branch, listening branch and multi-turn branch) based on the interaction scenario. Second, it is stratified into a large-scale pre-training subset and a curated, high-quality subset for Supervised Fine-Tuning (SFT). This dual structure accommodates a wide array of 2D virtual human tasks. In addition, we provide an autoregressive (AR)-based video chat baseline trained on this data, accompanied by a dedicated set of metrics and test data to serve as a benchmark VidChatBench for future work. Both the dataset and the corresponding data processing code will be publicly released. Project page: https://dorniwang.github.io/SpeakerVid-5M/

Community

Paper author Paper submitter

We propose SpeakerVid-5M, the first large-scale dataset designed specifically for the audio-visual dyadic interactive virtual human task. It includes 1M high-quality dialogue audiovisual pairs, with supporting for multi-turn conversations. The VidChatBench is also provided for better evaluation.

SpeakerVid-5M contains 5M single-speaker audiovisual clips, making it the largest talking human dataset. It covers a wide range of annotated visual formats, including talking heads, half-body, full-body, and side-view videos.

We open-source the entire dataset, including the raw data, annotations, and data processing pipeline, providing full transparency and reproducibility for the community. Project page:https://dorniwang.github.io/SpeakerVid-5M/

This is an excellent and interesting work !!

Youliang is pretty cool!!

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.09862 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.09862 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.09862 in a Space README.md to link it from this page.

Collections including this paper 5