HunyuanVideo-Avatar: High-Fidelity Audio-Driven Human Animation for Multiple Characters
Abstract
HunyuanVideo-Avatar, a multimodal diffusion transformer model, addresses challenges in audio-driven animation by ensuring character consistency, emotion alignment, and support for multi-character scenarios.
Recent years have witnessed significant progress in audio-driven human animation. However, critical challenges remain in (i) generating highly dynamic videos while preserving character consistency, (ii) achieving precise emotion alignment between characters and audio, and (iii) enabling multi-character audio-driven animation. To address these challenges, we propose HunyuanVideo-Avatar, a multimodal diffusion transformer (MM-DiT)-based model capable of simultaneously generating dynamic, emotion-controllable, and multi-character dialogue videos. Concretely, HunyuanVideo-Avatar introduces three key innovations: (i) A character image injection module is designed to replace the conventional addition-based character conditioning scheme, eliminating the inherent condition mismatch between training and inference. This ensures the dynamic motion and strong character consistency; (ii) An Audio Emotion Module (AEM) is introduced to extract and transfer the emotional cues from an emotion reference image to the target generated video, enabling fine-grained and accurate emotion style control; (iii) A Face-Aware Audio Adapter (FAA) is proposed to isolate the audio-driven character with latent-level face mask, enabling independent audio injection via cross-attention for multi-character scenarios. These innovations empower HunyuanVideo-Avatar to surpass state-of-the-art methods on benchmark datasets and a newly proposed wild dataset, generating realistic avatars in dynamic, immersive scenarios.
Community
Start from this exact angle, where the massive golden robot stands firmly between the glass towers, dust still rising from its landing. Suddenly, the camera begins a rapid and dynamic 3D cinematic pan around the robot, showing its majesty and power from every angle—front, right, back, left, and back again.
During this pan, the robot's blue lenses glow and emit a faint electronic sound, revealing the details of its sturdy body in the sunlight. The soundtrack builds to the beat, and the shot ends with the camera steadily in front of it, while it looks directly into the lens, as if something big is about to happen.
Models citing this paper 2
Datasets citing this paper 0
No dataset linking this paper