Pusa-V0.5 / README.md

Update README.md

434f41b verified 5 months ago

4.41 kB

	# Pusa VidGen

	[Code Repository](https://github.com/Yaofang-Liu/Pusa-VidGen) \| [Model Hub](https://huggingface.co/RaphaelLiu/Pusa-V0.5) \| [Training Toolkit](https://github.com/Yaofang-Liu/Mochi-Full-Finetuner) \| [Dataset](https://huggingface.co/datasets/RaphaelLiu/PusaV0.5_Training) \| [Pusa Paper](https://arxiv.org/abs/2507.16116) \| [FVDM Paper](https://arxiv.org/abs/2410.03160) \| [Follow on X](https://x.com/stephenajason) \| [Xiaohongshu](https://www.xiaohongshu.com/user/profile/5c6f928f0000000010015ca1?xsec_token=YBEf_x-s5bOBQIMJuNQvJ6H23Anwey1nnDgC9wiLyDHPU=&xsec_source=app_share&xhsshare=CopyLink&appuid=5c6f928f0000000010015ca1&apptime=1752622393&share_id=60f9a8041f974cb7ac5e3f0f161bf748)

	## Overview

	Pusa introduces a paradigm shift in video diffusion modeling through frame-level noise control, departing from conventional approaches. This shift was first presented in our [FVDM](https://arxiv.org/abs/2410.03160) paper. Leveraging this architecture, Pusa seamlessly supports diverse video generation tasks (Text/Image/Video-to-Video) while maintaining exceptional motion fidelity and prompt adherence with our refined base model adaptations. Pusa-V0.5 represents an early preview based on [Mochi1-Preview](https://huggingface.co/genmo/mochi-1-preview). We are open-sourcing this work to foster community collaboration, enhance methodologies, and expand capabilities.

	## ✨ Key Features

	- Comprehensive Multi-task Support:
	- Text-to-Video generation
	- Image-to-Video transformation
	- Frame interpolation
	- Video transitions
	- Seamless looping
	- Extended video generation
	- And more...

	- Unprecedented Efficiency:
	- Trained with only 0.1k H800 GPU hours
	- Total training cost: $0.1k
	- Hardware: 16 H800 GPUs
	- Configuration: Batch size 32, 500 training iterations, 1e-5 learning rate
	- Note: Efficiency can be further improved with single-node training and advanced parallelism techniques. Collaborations welcome!

	- Complete Open-Source Release:
	- Full codebase
	- Detailed architecture specifications
	- Comprehensive training methodology

	## 🔍 Unique Architecture

	- Novel Diffusion Paradigm: Implements frame-level noise control with vectorized timesteps, originally introduced in the [FVDM paper](https://arxiv.org/abs/2410.03160), enabling unprecedented flexibility and scalability.

	- Non-destructive Modification: Our adaptations to the base model preserve its original Text-to-Video generation capabilities. After this adaptation, we only need a slight fine-tuning.

	- Universal Applicability: The methodology can be readily applied to other leading video diffusion models including Hunyuan Video, Wan2.1, and others. Collaborations enthusiastically welcomed!

	## Installation and Usage

	### Download Weights

	Option 1: Use the Hugging Face CLI:
	```bash
	pip install huggingface_hub
	huggingface-cli download RaphaelLiu/Pusa-V0.5 --local-dir <path_to_downloaded_directory>
	```

	Option 2: Download directly from [Hugging Face](https://huggingface.co/RaphaelLiu/Pusa-V0.5) to your local machine.

	## Limitations

	Pusa currently has several known limitations:
	- The base Mochi model generates videos at relatively low resolution (480p)
	- We anticipate significant quality improvements when applying our methodology to more advanced models like Wan2.1
	- We welcome community contributions to enhance model performance and extend its capabilities

	## Related Work

	- [FVDM](https://arxiv.org/abs/2410.03160): Introduces the groundbreaking frame-level noise control with vectorized timestep approach that inspired Pusa.
	- [Mochi](https://huggingface.co/genmo/mochi-1-preview): Our foundation model, recognized as a leading open-source video generation system on the Artificial Analysis Leaderboard.

	## Citation

	If you find our work useful in your research, please consider citing:

	```
	@misc{Liu2025pusa,
	title={Pusa: Thousands Timesteps Video Diffusion Model},
	author={Yaofang Liu and Rui Liu},
	year={2025},
	url={https://github.com/Yaofang-Liu/Pusa-VidGen},
	}
	```

	```
	@article{liu2024redefining,
	title={Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach},
	author={Liu, Yaofang and Ren, Yumeng and Cun, Xiaodong and Artola, Aitor and Liu, Yang and Zeng, Tieyong and Chan, Raymond H and Morel, Jean-michel},
	journal={arXiv preprint arXiv:2410.03160},
	year={2024}
	}
	```