| # Pusa VidGen | |
| [Code Repository](https://github.com/Yaofang-Liu/Pusa-VidGen) | [Model Hub](https://huggingface.co/RaphaelLiu/Pusa-V0.5) | [Training Toolkit](https://github.com/Yaofang-Liu/Mochi-Full-Finetuner) | [Dataset](https://huggingface.co/datasets/RaphaelLiu/PusaV0.5_Training) | [Pusa Paper](https://arxiv.org/abs/2507.16116) | [FVDM Paper](https://arxiv.org/abs/2410.03160) | [Follow on X](https://x.com/stephenajason) | [Xiaohongshu](https://www.xiaohongshu.com/user/profile/5c6f928f0000000010015ca1?xsec_token=YBEf_x-s5bOBQIMJuNQvJ6H23Anwey1nnDgC9wiLyDHPU=&xsec_source=app_share&xhsshare=CopyLink&appuid=5c6f928f0000000010015ca1&apptime=1752622393&share_id=60f9a8041f974cb7ac5e3f0f161bf748) | |
| ## Overview | |
| Pusa introduces a paradigm shift in video diffusion modeling through frame-level noise control, departing from conventional approaches. This shift was first presented in our [FVDM](https://arxiv.org/abs/2410.03160) paper. Leveraging this architecture, Pusa seamlessly supports diverse video generation tasks (Text/Image/Video-to-Video) while maintaining exceptional motion fidelity and prompt adherence with our refined base model adaptations. Pusa-V0.5 represents an early preview based on [Mochi1-Preview](https://huggingface.co/genmo/mochi-1-preview). We are open-sourcing this work to foster community collaboration, enhance methodologies, and expand capabilities. | |
| ## ✨ Key Features | |
| - **Comprehensive Multi-task Support**: | |
| - Text-to-Video generation | |
| - Image-to-Video transformation | |
| - Frame interpolation | |
| - Video transitions | |
| - Seamless looping | |
| - Extended video generation | |
| - And more... | |
| - **Unprecedented Efficiency**: | |
| - Trained with only 0.1k H800 GPU hours | |
| - Total training cost: $0.1k | |
| - Hardware: 16 H800 GPUs | |
| - Configuration: Batch size 32, 500 training iterations, 1e-5 learning rate | |
| - *Note: Efficiency can be further improved with single-node training and advanced parallelism techniques. Collaborations welcome!* | |
| - **Complete Open-Source Release**: | |
| - Full codebase | |
| - Detailed architecture specifications | |
| - Comprehensive training methodology | |
| ## 🔍 Unique Architecture | |
| - **Novel Diffusion Paradigm**: Implements frame-level noise control with vectorized timesteps, originally introduced in the [FVDM paper](https://arxiv.org/abs/2410.03160), enabling unprecedented flexibility and scalability. | |
| - **Non-destructive Modification**: Our adaptations to the base model preserve its original Text-to-Video generation capabilities. After this adaptation, we only need a slight fine-tuning. | |
| - **Universal Applicability**: The methodology can be readily applied to other leading video diffusion models including Hunyuan Video, Wan2.1, and others. *Collaborations enthusiastically welcomed!* | |
| ## Installation and Usage | |
| ### Download Weights | |
| **Option 1**: Use the Hugging Face CLI: | |
| ```bash | |
| pip install huggingface_hub | |
| huggingface-cli download RaphaelLiu/Pusa-V0.5 --local-dir <path_to_downloaded_directory> | |
| ``` | |
| **Option 2**: Download directly from [Hugging Face](https://huggingface.co/RaphaelLiu/Pusa-V0.5) to your local machine. | |
| ## Limitations | |
| Pusa currently has several known limitations: | |
| - The base Mochi model generates videos at relatively low resolution (480p) | |
| - We anticipate significant quality improvements when applying our methodology to more advanced models like Wan2.1 | |
| - We welcome community contributions to enhance model performance and extend its capabilities | |
| ## Related Work | |
| - [FVDM](https://arxiv.org/abs/2410.03160): Introduces the groundbreaking frame-level noise control with vectorized timestep approach that inspired Pusa. | |
| - [Mochi](https://huggingface.co/genmo/mochi-1-preview): Our foundation model, recognized as a leading open-source video generation system on the Artificial Analysis Leaderboard. | |
| ## Citation | |
| If you find our work useful in your research, please consider citing: | |
| ``` | |
| @misc{Liu2025pusa, | |
| title={Pusa: Thousands Timesteps Video Diffusion Model}, | |
| author={Yaofang Liu and Rui Liu}, | |
| year={2025}, | |
| url={https://github.com/Yaofang-Liu/Pusa-VidGen}, | |
| } | |
| ``` | |
| ``` | |
| @article{liu2024redefining, | |
| title={Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach}, | |
| author={Liu, Yaofang and Ren, Yumeng and Cun, Xiaodong and Artola, Aitor and Liu, Yang and Zeng, Tieyong and Chan, Raymond H and Morel, Jean-michel}, | |
| journal={arXiv preprint arXiv:2410.03160}, | |
| year={2024} | |
| } | |
| ``` | |