Image-to-Video
phantom
TianxiangMa commited on
Commit
635352f
·
verified ·
1 Parent(s): 4a01dad

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +109 -3
README.md CHANGED
@@ -1,3 +1,109 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Phantom: Subject-Consistent Video Generation via Cross-Modal Alignment
2
+
3
+ <div align="center">
4
+
5
+ [![arXiv](https://img.shields.io/badge/arXiv%20paper-2502.11079-b31b1b.svg)](https://arxiv.org/abs/2502.11079)&nbsp;
6
+ [![project page](https://img.shields.io/badge/Project_page-More_visualizations-green)](https://phantom-video.github.io/Phantom/)&nbsp;
7
+
8
+ </div>
9
+
10
+
11
+ > [**Phantom: Subject-Consistent Video Generation via Cross-Modal Alignment**](https://arxiv.org/abs/2502.11079)<br>
12
+ > [Lijie Liu](https://liulj13.github.io/)<sup> * </sup>, [Tianxiang Ma](https://tianxiangma.github.io/)<sup> * </sup>, [Bingchuan Li](https://scholar.google.com/citations?user=ac5Se6QAAAAJ)<sup> * &dagger;</sup>, [Zhuowei Chen](https://scholar.google.com/citations?user=ow1jGJkAAAAJ)<sup> * </sup>, [Jiawei Liu](https://scholar.google.com/citations?user=X21Fz-EAAAAJ), Gen Li, Siyu Zhou, [Qian He](https://scholar.google.com/citations?user=9rWWCgUAAAAJ), Xinglong Wu
13
+ > <br><sup> * </sup>Equal contribution,<sup> &dagger; </sup>Project lead
14
+ > <br>Intelligent Creation Team, ByteDance<br>
15
+
16
+ <p align="center">
17
+ <img src="https://github.com/Phantom-video/Phantom/blob/main/assets/teaser.png" width=95%>
18
+ <p>
19
+
20
+ ## 🔥 Latest News!
21
+ * Apr 20, 2025: 👋 Phantom-Wan is coming! We adapted the Phantom framework into the [Wan2.1](https://github.com/Wan-Video/Wan2.1) video generation model. The inference codes and checkpoint have been released.
22
+
23
+ ## 📑 Todo List
24
+ - [x] Inference codes and Checkpoint of Phantom-Wan 1.3B
25
+ - [ ] Checkpoint of Phantom-Wan 14B
26
+ - [ ] Training codes of Phantom-Wan
27
+
28
+ ## 📖 Overview
29
+ Phantom is a unified video generation framework for single and multi-subject references, built on existing text-to-video and image-to-video architectures. It achieves cross-modal alignment using text-image-video triplet data by redesigning the joint text-image injection model. Additionally, it emphasizes subject consistency in human generation while enhancing ID-preserving video generation.
30
+
31
+ ## ⚡️ Quickstart
32
+
33
+ ### Installation
34
+ Clone the repo:
35
+ ```sh
36
+ git clone https://github.com/Phantom-video/Phantom.git
37
+ cd Phantom
38
+ ```
39
+
40
+ Install dependencies:
41
+ ```sh
42
+ # Ensure torch >= 2.4.0
43
+ pip install -r requirements.txt
44
+ ```
45
+
46
+ ### Model Download
47
+ First you need to download the 1.3B original model of Wan2.1. Download Wan2.1-1.3B using huggingface-cli:
48
+ ``` sh
49
+ pip install "huggingface_hub[cli]"
50
+ huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./Wan2.1-T2V-1.3B
51
+ ```
52
+ Then download the Phantom-Wan-1.3B model:
53
+ ``` sh
54
+ huggingface-cli download xxx --local-dir ./Phantom-Wan-1.3B
55
+ ```
56
+
57
+ ### Run Subject-to-Video Generation
58
+
59
+ - Single-GPU inference
60
+
61
+ ``` sh
62
+ python generate.py --task s2v-1.3B --size 832*480 --ckpt_dir ./Wan2.1-T2V-1.3B --phantom_ckpt ./Phantom-Wan-1.3B/Phantom-Wan-1.3B.pth --ref_image "examples/ref1.png,examples/ref2.png" --prompt "暖阳漫过草地,扎着双马尾、头戴绿色蝴蝶结、身穿浅绿色连衣裙的小女孩蹲在盛开的雏菊旁。她身旁一只棕白相间的狗狗吐着舌头,毛茸茸尾巴欢快摇晃。小女孩笑着举起黄红配色、带有蓝色按钮的玩具相机,将和狗狗的欢乐瞬间定格。" --base_seed 42
63
+ ```
64
+
65
+ - Multi-GPU inference using FSDP + xDiT USP
66
+
67
+ ``` sh
68
+ pip install "xfuser>=0.4.1"
69
+ torchrun --nproc_per_node=8 generate.py --task s2v-1.3B --size 832*480 --ckpt_dir ./Wan2.1-T2V-1.3B --phantom_ckpt ./Phantom-Wan-1.3B/Phantom-Wan-1.3B.pth --ref_image "examples/ref3.png,examples/ref4.png" --dit_fsdp --t5_fsdp --ulysses_size 4 --ring_size 2 --prompt "夕阳下,一位有着小麦色肌肤、留着乌黑长发的女人穿上有着大朵立体花朵装饰、肩袖处带有飘逸纱带的红色纱裙,漫步在金色的海滩上,海风轻拂她的长发,画面唯美动人。" --base_seed 42
70
+ ```
71
+
72
+ > 💡Note:
73
+ > * Changing `--ref_image` can achieve single reference Subject-to-Video generation or multi-reference Subject-to-Video generation. The number of reference images should be within 4.
74
+ > * To achieve the best generation results, we recommend that you describe the visual content of the reference image as accurately as possible when writing `--prompt`. For example, "examples/ref1.png" can be described as "a toy camera in yellow and red with blue buttons".
75
+ > * When the generated video is unsatisfactory, the most straightforward solution is to try changing the `--base_seed` and modifying the description in the `--prompt`.
76
+
77
+ For inferencing examples, please refer to "infer.sh". You will get the following generated results:
78
+
79
+ <table>
80
+ <tr>
81
+ <td><img src="https://github.com/Phantom-video/Phantom/blob/main/examples/ref_results/result1.gif" alt="GIF 1" width="200"></td>
82
+ <td><img src="https://github.com/Phantom-video/Phantom/blob/main/examples/ref_results/result2.gif" alt="GIF 2" width="200"></td>
83
+ </tr>
84
+ <tr>
85
+ <td><img src="https://github.com/Phantom-video/Phantom/blob/main/examples/ref_results/result3.gif" alt="GIF 3" width="200"></td>
86
+ <td><img src="https://github.com/Phantom-video/Phantom/blob/main/examples/ref_results/result4.gif" alt="GIF 4" width="200"></td>
87
+ </tr>
88
+ </table>
89
+
90
+ ## 🆚 Comparative Results
91
+ - **Identity Preserving Video Generation**.
92
+ ![image](https://github.com/Phantom-video/Phantom/blob/main/assets/images/id_eval.png)
93
+ - **Single Reference Subject-to-Video Generation**.
94
+ ![image](https://github.com/Phantom-video/Phantom/blob/main/assets/images/ip_eval_s.png)
95
+ - **Multi-Reference Subject-to-Video Generation**.
96
+ ![image](https://github.com/Phantom-video/Phantom/blob/main/assets/images/ip_eval_m_00.png)
97
+
98
+ ## Acknowledgements
99
+ We would like to express our gratitude to the SEED team for their support. Special thanks to Lu Jiang, Haoyuan Guo, Zhibei Ma, and Sen Wang for their assistance with the model and data. In addition, we are also very grateful to Siying Chen, Qingyang Li, and Wei Han for their help with the evaluation.
100
+
101
+ ## BibTeX
102
+ ```bibtex
103
+ @article{liu2025phantom,
104
+ title={Phantom: Subject-Consistent Video Generation via Cross-Modal Alignment},
105
+ author={Liu, Lijie and Ma, Tianxaing and Li, Bingchuan and Chen, Zhuowei and Liu, Jiawei and He, Qian and Wu, Xinglong},
106
+ journal={arXiv preprint arXiv:2502.11079},
107
+ year={2025}
108
+ }
109
+ ```