Pusa V1.0 Model

Overview

The rapid advancement of video diffusion models has been hindered by fundamental limitations in temporal modeling, particularly the rigid synchronization of frame evolution imposed by conventional scalar timestep variables. While task-specific adaptations and autoregressive models have sought to address these challenges, they remain constrained by computational inefficiency, catastrophic forgetting, or narrow applicability.

In this work, we present Pusa¹, a groundbreaking paradigm that leverages vectorized timestep adaptation (VTA) to enable fine-grained temporal control within a unified video diffusion framework. VTA is a non-destructive adaptation, meaning it fully preserves the capabilities of the base model. By finetuning the SOTA Wan2.1-T2V-14B model with VTA, we achieve unprecedented efficiency—surpassing the performance of Wan-I2V-14B with ≤ 1/200 of the training cost ($500 vs. ≥ $100,000) and ≤ 1/2500 of the dataset size (4K vs. ≥ 10M samples).

Pusa not only sets a new standard for image-to-video (I2V) generation but also unlocks many zero-shot multi-task capabilities such as start-end frames and video extension—all without task-specific training. Meanwhile, Pusa can still perform text-to-video generation. This work establishes a scalable, efficient, and versatile paradigm for next-generation video synthesis, democratizing high-fidelity video generation for research and industry alike.

¹Pusa (菩萨, /pu: 'sA:/) normally refers to "Thousand-Hand Guanyin" in Chinese, reflecting the iconography of many hands to symbolize her boundless compassion and ability. We use this name to indicate that our model uses many timestep variables to achieve numerous video generation capabilities, and we will fully open source it to let the community benefit from this tech.

Performance Highlights

Pusa V1.0, with only 10 inference steps, achieves state-of-the-art performance among open-source models. It surpasses its direct baseline, Wan-I2V, which was trained with vastly greater resources. Our model obtains a VBench-I2V total score of 87.32%, outperforming Wan-I2V's 86.86%.

✨ Key Features

Comprehensive Multi-task Support:
- Image-to-Video
- Start-End Frames
- Video Completion
- Video Extension
- Text-to-Video
- Video Transition
- And more...
Unprecedented Efficiency:
- Fine-tuned on Wan-T2V using LoRA for training with less resources.
- Total training cost: $0.5K (at least 200x more efficient than the Wan-I2V baseline).
- Dataset size: 4K samples (at least 2500x smaller than the Wan-I2V baseline).
- Hardware: 8x 80GB GPUs with DeepSpeed Zero2.
- Note: Our method is also compatible with full fine-tuning as demonstrated in Pusa V0.5.
Complete Open-Source Release
- Full codebase
- Checkpoints
- Training Dataset
- Paper

🔍 Unique Architecture

Novel Diffusion Paradigm: Enables frame-level noise control with vectorized timesteps, originally introduced in the FVDM paper, enabling unprecedented flexibility and scalability.
Efficient & Non-destructive Adaptation: We perform lightweight fine-tuning on the SOTA open-source Wan-T2V model using the vectorized timestep adaptation (VTA) technique. This VTA approach make it easy to injects temporal dynamics while fully preserving the foundation model's generative priors and T2V capabilities.
Universal Applicability: Pusa V1.0 demonstrates the successful application of FVDM framework to the state-of-the-art Wan-T2V model. The methodology can be readily applied to other leading video diffusion models. Collaborations enthusiastically welcomed!

Installation and Usage

Download Weights and Recover The Checkpoint

Option 1: Use the Hugging Face CLI:

# Make sure you are in the PusaV1 directory
# Install huggingface-cli if you don't have it
pip install -U "huggingface_hub[cli]"
huggingface-cli download RaphaelLiu/PusaV1 --local-dir ./model_zoo/PusaV1

# (Optional) Please download Wan2.1-T2V-14B to ./model_zoo/PusaV1 is you don't have it, if you have you can directly soft link it to ./model_zoo/PusaV1
huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir ./model_zoo/PusaV1

Option 2: Download directly pusa_v1.pt or pusa_v1.safetensors from Hugging Face to your local machine.

Related Work

FVDM: Introduces the groundbreaking frame-level noise control with vectorized timestep approach that inspired Pusa.
Wan-T2V: The base model for Pusa V1.0, a state-of-the-art open-source video generation model.
DiffSynth-Studio: We leverage its optimized LoRA implementation for efficient diffusion model training.

Citation

If you find our work useful in your research, please consider citing:

@article{liu2025pusa,
  title={PUSA V1. 0: Surpassing Wan-I2V with $500 Training Cost by Vectorized Timestep Adaptation},
  author={Liu, Yaofang and Ren, Yumeng and Artola, Aitor and Hu, Yuxuan and Cun, Xiaodong and Zhao, Xiaotong and Zhao, Alan and Chan, Raymond H and Zhang, Suiyun and Liu, Rui and others},
  journal={arXiv preprint arXiv:2507.16116},
  year={2025}
}

@article{liu2024redefining,
  title={Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach},
  author={Liu, Yaofang and Ren, Yumeng and Cun, Xiaodong and Artola, Aitor and Liu, Yang and Zeng, Tieyong and Chan, Raymond H and Morel, Jean-michel},
  journal={arXiv preprint arXiv:2410.03160},
  year={2024}
}

RaphaelLiu
/

PusaV1