Pusa-V0.5 / README.md
RaphaelLiu's picture
Update README.md
434f41b verified
|
raw
history blame
4.41 kB
# Pusa VidGen
[Code Repository](https://github.com/Yaofang-Liu/Pusa-VidGen) | [Model Hub](https://huggingface.co/RaphaelLiu/Pusa-V0.5) | [Training Toolkit](https://github.com/Yaofang-Liu/Mochi-Full-Finetuner) | [Dataset](https://huggingface.co/datasets/RaphaelLiu/PusaV0.5_Training) | [Pusa Paper](https://arxiv.org/abs/2507.16116) | [FVDM Paper](https://arxiv.org/abs/2410.03160) | [Follow on X](https://x.com/stephenajason) | [Xiaohongshu](https://www.xiaohongshu.com/user/profile/5c6f928f0000000010015ca1?xsec_token=YBEf_x-s5bOBQIMJuNQvJ6H23Anwey1nnDgC9wiLyDHPU=&xsec_source=app_share&xhsshare=CopyLink&appuid=5c6f928f0000000010015ca1&apptime=1752622393&share_id=60f9a8041f974cb7ac5e3f0f161bf748)
## Overview
Pusa introduces a paradigm shift in video diffusion modeling through frame-level noise control, departing from conventional approaches. This shift was first presented in our [FVDM](https://arxiv.org/abs/2410.03160) paper. Leveraging this architecture, Pusa seamlessly supports diverse video generation tasks (Text/Image/Video-to-Video) while maintaining exceptional motion fidelity and prompt adherence with our refined base model adaptations. Pusa-V0.5 represents an early preview based on [Mochi1-Preview](https://huggingface.co/genmo/mochi-1-preview). We are open-sourcing this work to foster community collaboration, enhance methodologies, and expand capabilities.
## ✨ Key Features
- **Comprehensive Multi-task Support**:
- Text-to-Video generation
- Image-to-Video transformation
- Frame interpolation
- Video transitions
- Seamless looping
- Extended video generation
- And more...
- **Unprecedented Efficiency**:
- Trained with only 0.1k H800 GPU hours
- Total training cost: $0.1k
- Hardware: 16 H800 GPUs
- Configuration: Batch size 32, 500 training iterations, 1e-5 learning rate
- *Note: Efficiency can be further improved with single-node training and advanced parallelism techniques. Collaborations welcome!*
- **Complete Open-Source Release**:
- Full codebase
- Detailed architecture specifications
- Comprehensive training methodology
## 🔍 Unique Architecture
- **Novel Diffusion Paradigm**: Implements frame-level noise control with vectorized timesteps, originally introduced in the [FVDM paper](https://arxiv.org/abs/2410.03160), enabling unprecedented flexibility and scalability.
- **Non-destructive Modification**: Our adaptations to the base model preserve its original Text-to-Video generation capabilities. After this adaptation, we only need a slight fine-tuning.
- **Universal Applicability**: The methodology can be readily applied to other leading video diffusion models including Hunyuan Video, Wan2.1, and others. *Collaborations enthusiastically welcomed!*
## Installation and Usage
### Download Weights
**Option 1**: Use the Hugging Face CLI:
```bash
pip install huggingface_hub
huggingface-cli download RaphaelLiu/Pusa-V0.5 --local-dir <path_to_downloaded_directory>
```
**Option 2**: Download directly from [Hugging Face](https://huggingface.co/RaphaelLiu/Pusa-V0.5) to your local machine.
## Limitations
Pusa currently has several known limitations:
- The base Mochi model generates videos at relatively low resolution (480p)
- We anticipate significant quality improvements when applying our methodology to more advanced models like Wan2.1
- We welcome community contributions to enhance model performance and extend its capabilities
## Related Work
- [FVDM](https://arxiv.org/abs/2410.03160): Introduces the groundbreaking frame-level noise control with vectorized timestep approach that inspired Pusa.
- [Mochi](https://huggingface.co/genmo/mochi-1-preview): Our foundation model, recognized as a leading open-source video generation system on the Artificial Analysis Leaderboard.
## Citation
If you find our work useful in your research, please consider citing:
```
@misc{Liu2025pusa,
  title={Pusa: Thousands Timesteps Video Diffusion Model},
  author={Yaofang Liu and Rui Liu},
  year={2025},
  url={https://github.com/Yaofang-Liu/Pusa-VidGen},
}
```
```
@article{liu2024redefining,
  title={Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach},
  author={Liu, Yaofang and Ren, Yumeng and Cun, Xiaodong and Artola, Aitor and Liu, Yang and Zeng, Tieyong and Chan, Raymond H and Morel, Jean-michel},
  journal={arXiv preprint arXiv:2410.03160},
  year={2024}
}
```