arxiv:2507.16116

PUSA V1.0: Surpassing Wan-I2V with $500 Training Cost by Vectorized Timestep Adaptation

Published on Jul 22

· Submitted by

RaphaelLiu on Jul 24

Upvote

Authors:

Yaofang Liu ,

Abstract

Pusa, a vectorized timestep adaptation approach, enhances video diffusion models for efficient and versatile video generation, improving performance and reducing costs.

AI-generated summary

The rapid advancement of video diffusion models has been hindered by fundamental limitations in temporal modeling, particularly the rigid synchronization of frame evolution imposed by conventional scalar timestep variables. While task-specific adaptations and autoregressive models have sought to address these challenges, they remain constrained by computational inefficiency, catastrophic forgetting, or narrow applicability. In this work, we present Pusa, a groundbreaking paradigm that leverages vectorized timestep adaptation (VTA) to enable fine-grained temporal control within a unified video diffusion framework. Besides, VTA is a non-destructive adaptation, which means it fully preserves the capabilities of the base model. By finetuning the SOTA Wan2.1-T2V-14B model with VTA, we achieve unprecedented efficiency -- surpassing the performance of Wan-I2V-14B with leq 1/200 of the training cost (\500 vs. \geq 100,000) and leq 1/2500 of the dataset size (4K vs. geq 10M samples). Pusa not only sets a new standard for image-to-video (I2V) generation, achieving a VBench-I2V total score of 87.32\% (vs. 86.86\% of Wan-I2V-14B), but also unlocks many zero-shot multi-task capabilities such as start-end frames and video extension -- all without task-specific training. Meanwhile, Pusa can still perform text-to-video generation. Mechanistic analyses reveal that our approach preserves the foundation model's generative priors while surgically injecting temporal dynamics, avoiding the combinatorial explosion inherent to vectorized timesteps. This work establishes a scalable, efficient, and versatile paradigm for next-generation video synthesis, democratizing high-fidelity video generation for research and industry alike. Code is open-sourced at https://github.com/Yaofang-Liu/Pusa-VidGen

View arXiv page View PDF Project page GitHub 534 Add to collection

Community

RaphaelLiu

Paper author Paper submitter 4 days ago

Code: https://github.com/Yaofang-Liu/Pusa-VidGen
Project Page：https://yaofang-liu.github.io/Pusa_Web/
Model：https://huggingface.co/RaphaelLiu/PusaV1
Dataset：https://huggingface.co/datasets/RaphaelLiu/PusaV1_training

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

PUSA V1.0: Surpassing Wan-I2V with $500 Training Cost by Vectorized Timestep Adaptation

Abstract

Community

Models citing this paper 2

Datasets citing this paper 2

Spaces citing this paper 2

Collections including this paper 3