arxiv:2505.20156

HunyuanVideo-Avatar: High-Fidelity Audio-Driven Human Animation for Multiple Characters

Published on May 26

Authors:

Abstract

HunyuanVideo-Avatar, a multimodal diffusion transformer model, addresses challenges in audio-driven animation by ensuring character consistency, emotion alignment, and support for multi-character scenarios.

AI-generated summary

Recent years have witnessed significant progress in audio-driven human animation. However, critical challenges remain in (i) generating highly dynamic videos while preserving character consistency, (ii) achieving precise emotion alignment between characters and audio, and (iii) enabling multi-character audio-driven animation. To address these challenges, we propose HunyuanVideo-Avatar, a multimodal diffusion transformer (MM-DiT)-based model capable of simultaneously generating dynamic, emotion-controllable, and multi-character dialogue videos. Concretely, HunyuanVideo-Avatar introduces three key innovations: (i) A character image injection module is designed to replace the conventional addition-based character conditioning scheme, eliminating the inherent condition mismatch between training and inference. This ensures the dynamic motion and strong character consistency; (ii) An Audio Emotion Module (AEM) is introduced to extract and transfer the emotional cues from an emotion reference image to the target generated video, enabling fine-grained and accurate emotion style control; (iii) A Face-Aware Audio Adapter (FAA) is proposed to isolate the audio-driven character with latent-level face mask, enabling independent audio injection via cross-attention for multi-character scenarios. These innovations empower HunyuanVideo-Avatar to surpass state-of-the-art methods on benchmark datasets and a newly proposed wild dataset, generating realistic avatars in dynamic, immersive scenarios.

View arXiv page View PDF Add to collection

Community

Aehm

Jun 11

•

edited Jun 11

Start from this exact angle, where the massive golden robot stands firmly between the glass towers, dust still rising from its landing. Suddenly, the camera begins a rapid and dynamic 3D cinematic pan around the robot, showing its majesty and power from every angle—front, right, back, left, and back again.

During this pan, the robot's blue lenses glow and emit a faint electronic sound, revealing the details of its sturdy body in the sunlight. The soundtrack builds to the beat, and the shot ends with the camera steadily in front of it, while it looks directly into the lens, as if something big is about to happen.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.20156 in a dataset README.md to link it from this page.

HunyuanVideo-Avatar: High-Fidelity Audio-Driven Human Animation for Multiple Characters

Abstract

Community

Models citing this paper 2

Datasets citing this paper 0

Spaces citing this paper 3

Collections including this paper 1