Improve model card
Browse filesThis PR improves the model card by adding a link to the paper, specifying the correct library name, and adding more details about the model from the paper abstract. It also restructures the examples section to be more user-friendly.
README.md
CHANGED
@@ -1,14 +1,37 @@
|
|
1 |
---
|
2 |
-
|
|
|
3 |
datasets:
|
4 |
- nkp37/OpenVid-1M
|
5 |
- TempoFunk/webvid-10M
|
6 |
-
|
7 |
-
- VideoCrafter/VideoCrafter2
|
8 |
pipeline_tag: text-to-video
|
|
|
9 |
---
|
|
|
10 |
# Advanced text-to-video Diffusion Models
|
11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
|
13 |
⚡️ This repository provides training recipes for the AMD efficient text-to-video models, which are designed for high performance and efficiency. The training process includes two key steps:
|
14 |
|
@@ -18,88 +41,111 @@ pipeline_tag: text-to-video
|
|
18 |
|
19 |
This implementation is released to promote further research and innovation in the field of efficient text-to-video generation, optimized for AMD Instinct accelerators.
|
20 |
|
21 |
-
You can download the code from our [GitHub Repo](https://github.com/AMD-AIG-AIMA/AMD-Hummingbird-T2V).
|
22 |
-
|
23 |
-
<img src="GIFs/vbench.png" alt="Vbench performance" title="Vbench performance" class="vbench-img">
|
24 |
-
|
25 |
-
|
26 |
-
**8-Steps Results**
|
27 |
-
<style>
|
28 |
-
table {
|
29 |
-
width: auto;
|
30 |
-
border-collapse: collapse;
|
31 |
-
}
|
32 |
-
th, td {
|
33 |
-
border: 1px solid #ddd;
|
34 |
-
text-align: center;
|
35 |
-
padding: 0px;
|
36 |
-
vertical-align: middle;
|
37 |
-
width: 256px; /* 每列宽度固定 */
|
38 |
-
}
|
39 |
-
tr.text-row {
|
40 |
-
height: 30px; /* 文字行高度 */
|
41 |
-
}
|
42 |
-
tr.image-row {
|
43 |
-
height: 160px; /* 图片行高度 */
|
44 |
-
}
|
45 |
-
/* 默认表格中的图片大小 */
|
46 |
-
img {
|
47 |
-
width: 256px;
|
48 |
-
height: 160px;
|
49 |
-
object-fit: cover;
|
50 |
-
}
|
51 |
-
/* 只影响 vbench.png */
|
52 |
-
.vbench-img {
|
53 |
-
width: 785px !important;
|
54 |
-
height: 698px !important;
|
55 |
-
object-fit: contain; /* 让图片完整显示,不裁剪 */
|
56 |
-
}
|
57 |
-
</style>
|
58 |
-
|
59 |
-
|
60 |
-
<table>
|
61 |
-
<tr class="text-row">
|
62 |
-
<th>A cute happy Corgi playing in park, sunset, pixel.</th>
|
63 |
-
<th>A cute happy Corgi playing in park, sunset, animated style.</th>
|
64 |
-
<th>A cute raccoon playing guitar in the beach.</th>
|
65 |
-
<th>A cute raccoon playing guitar in the forest.</th>
|
66 |
-
</tr>
|
67 |
-
<tr class="image-row">
|
68 |
-
<td><img src="GIFs/A_cute_happy_Corgi_playing_in_park,_sunset,_pixel_.gif"></td>
|
69 |
-
<td><img src="GIFs/A cute happy Corgi playing in park, sunset, animated style.gif"></td>
|
70 |
-
<td><img src="GIFs/A cute raccoon playing guitar in the beach.gif"></td>
|
71 |
-
<td><img src="GIFs/A cute raccoon playing guitar in the forest.gif"></td>
|
72 |
-
</tr>
|
73 |
-
<tr class="text-row">
|
74 |
-
<th>A quiet beach at dawn and the waves gently lapping.</th>
|
75 |
-
<th>A cute teddy bear, dressed in a red silk outfit, stands in a vibrant street, Chinese New Year.</th>
|
76 |
-
<th>A sandcastle being eroded by the incoming tide.</th>
|
77 |
-
<th>An astronaut flying in space, in cyberpunk style.</th>
|
78 |
-
</tr>
|
79 |
-
<tr class="image-row">
|
80 |
-
<td><img src="GIFs/A_quiet_beach_at_dawn_and_the_waves_gently_lapping.gif"></td>
|
81 |
-
<td><img src="GIFs/A cute teddy bear, dressed in a red silk outfit, stands in a vibrant street, chinese new year..gif"></td>
|
82 |
-
<td><img src="GIFs/A sandcastle being eroded by the incoming tide.gif"></td>
|
83 |
-
<td><img src="GIFs/An astronaut flying in space, in cyberpunk style.gif"></td>
|
84 |
-
</tr>
|
85 |
-
<tr class="text-row">
|
86 |
-
<th>A cat DJ at a party.</th>
|
87 |
-
<th>A 3D model of a 1800s victorian house.</th>
|
88 |
-
<th>A drone flying over a snowy forest.</th>
|
89 |
-
<th>A ghost ship navigating through a sea under a moon.</th>
|
90 |
-
</tr>
|
91 |
-
<tr class="image-row">
|
92 |
-
<td><img src="GIFs/A_cat_DJ_at_a_party.gif"></td>
|
93 |
-
<td><img src="GIFs/A 3D model of a 1800s victorian house..gif"></td>
|
94 |
-
<td><img src="GIFs/a_drone_flying_over_a_snowy_forest.gif"></td>
|
95 |
-
<td><img src="GIFs/A_ghost_ship_navigating_through_a_sea_under_a_moon.gif"></td>
|
96 |
-
</tr>
|
97 |
-
</table>
|
98 |
-
|
99 |
-
|
100 |
-
|
101 |
-
|
102 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
103 |
|
104 |
# License
|
105 |
Copyright (c) 2024 Advanced Micro Devices, Inc. All Rights Reserved.
|
|
|
1 |
---
|
2 |
+
base_model:
|
3 |
+
- VideoCrafter/VideoCrafter2
|
4 |
datasets:
|
5 |
- nkp37/OpenVid-1M
|
6 |
- TempoFunk/webvid-10M
|
7 |
+
license: gpl-3.0
|
|
|
8 |
pipeline_tag: text-to-video
|
9 |
+
library_name: diffusers
|
10 |
---
|
11 |
+
|
12 |
# Advanced text-to-video Diffusion Models
|
13 |
|
14 |
+
This repository contains the model from the paper [AMD-Hummingbird: Towards an Efficient Text-to-Video Model](https://huggingface.co/papers/2503.18559). Hummingbird is a lightweight text-to-video (T2V) framework that prunes existing models (such as VideoCrafter2) and enhances visual quality through visual feedback learning. It aims to improve the efficiency of T2V generation, making it more suitable for deployment on resource-limited devices while preserving high-quality video generation.
|
15 |
+
|
16 |
+
## Table of Contents
|
17 |
+
- [Advanced text-to-video Diffusion Models](#advanced-text-to-video-diffusion-models)
|
18 |
+
- [Key Features](#key-features)
|
19 |
+
- [8-Steps Results](#8-steps-results)
|
20 |
+
- [Checkpoint](#checkpoint)
|
21 |
+
- [Installation](#installation)
|
22 |
+
- [conda](#conda)
|
23 |
+
- [docker](#docker)
|
24 |
+
- [Data Processing](#data-processing)
|
25 |
+
- [VQA](#vqa)
|
26 |
+
- [Remove Dolly Zoom Videos](#remove-dolly-zoom-videos)
|
27 |
+
- [Training](#training)
|
28 |
+
- [Model Distillation](#model-distillation)
|
29 |
+
- [Acceleration Training](#acceleration-training)
|
30 |
+
- [Inference](#inference)
|
31 |
+
- [License](#license)
|
32 |
+
|
33 |
+
|
34 |
+
## Key Features
|
35 |
|
36 |
⚡️ This repository provides training recipes for the AMD efficient text-to-video models, which are designed for high performance and efficiency. The training process includes two key steps:
|
37 |
|
|
|
41 |
|
42 |
This implementation is released to promote further research and innovation in the field of efficient text-to-video generation, optimized for AMD Instinct accelerators.
|
43 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
44 |
|
45 |
+

|
46 |
+
|
47 |
+
|
48 |
+
|
49 |
+
|
50 |
+
## 8-Steps Results
|
51 |
+
|
52 |
+
| Prompt | Generated Video | Prompt | Generated Video |
|
53 |
+
|--------------------------------------------|--------------------------|----------------------------------------------|------------------------|
|
54 |
+
| A cute happy Corgi playing in park, sunset, pixel. |  | A cute happy Corgi playing in park, sunset, animated style. |  |
|
55 |
+
| A quiet beach at dawn and the waves gently lapping. |  | A cute teddy bear, dressed in a red silk outfit, stands in a vibrant street, chinese new year. |  |
|
56 |
+
| A cat DJ at a party. |  | A 3D model of a 1800s victorian house. |  |
|
57 |
+
| A cute raccoon playing guitar in the beach. |  | A cute raccoon playing guitar in the forest. | |
|
58 |
+
| A sandcastle being eroded by the incoming tide. |  | An astronaut flying in space, in cyberpunk style. |  |
|
59 |
+
| A drone flying over a snowy forest. |  | A ghost ship navigating through a sea under a moon. |  |
|
60 |
+
|
61 |
+
|
62 |
+
# Checkpoint
|
63 |
+
Our pretrained checkpoint can be downloaded from [HuggingFace](https://huggingface.co/amd/AMD-Hummingbird-T2V/tree/main)
|
64 |
+
|
65 |
+
# Installation
|
66 |
+
We train both 0.9B and 0.7 T2V models on MI250 and evalute them on MI250, MI300, RTX7900xt and RadeonTM 880M RyzenTM AI 9 365 Ubuntu 6.8.0-51-generic.
|
67 |
+
|
68 |
+
## conda
|
69 |
+
```
|
70 |
+
conda create -n AMD_Hummingbird python=3.10
|
71 |
+
conda activate AMD_Hummingbird
|
72 |
+
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/rocm6.1
|
73 |
+
pip install -r requirements.txt
|
74 |
+
```
|
75 |
+
For rocm flash-attn, you can install it by this [link](https://github.com/ROCm/flash-attention).
|
76 |
+
```
|
77 |
+
git clone https://github.com/ROCm/flash-attention.git
|
78 |
+
cd flash-attention
|
79 |
+
python setup.py install
|
80 |
+
```
|
81 |
+
It will take about 1.5 hours to install.
|
82 |
+
|
83 |
+
## docker
|
84 |
+
First, you should use `docker pull` to download the image.
|
85 |
+
```
|
86 |
+
docker pull rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
|
87 |
+
```
|
88 |
+
Second, you can use `docker run` to run the image, for example:
|
89 |
+
```
|
90 |
+
docker run \
|
91 |
+
-v "$(pwd):/workspace" \
|
92 |
+
--device=/dev/kfd \
|
93 |
+
--device=/dev/dri \
|
94 |
+
-it \
|
95 |
+
--network=host \
|
96 |
+
--name hummingbird \
|
97 |
+
rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
|
98 |
+
```
|
99 |
+
When you in the container, you can use `pip` to install other dependencies:
|
100 |
+
```
|
101 |
+
pip install -r requirements.txt
|
102 |
+
```
|
103 |
+
|
104 |
+
# Data Processing
|
105 |
+
|
106 |
+
## VQA
|
107 |
+
```
|
108 |
+
cd data_pre_process/DOVER
|
109 |
+
sh run.sh
|
110 |
+
```
|
111 |
+
Then you can get a score table for all video qualities, sort according to the table, and remove low-scoring videos.
|
112 |
+
## Remove Dolly Zoom Videos
|
113 |
+
```
|
114 |
+
cd data_pre_process/VBench
|
115 |
+
sh run.sh
|
116 |
+
```
|
117 |
+
According to the motion smoothness score csv file, you can remove low-scoring videos.
|
118 |
+
# Training
|
119 |
+
|
120 |
+
## Model Distillation
|
121 |
+
|
122 |
+
```
|
123 |
+
sh configs/training_512_t2v_v1.0/run_distill.sh
|
124 |
+
```
|
125 |
+
|
126 |
+
|
127 |
+
## Acceleration Training
|
128 |
+
|
129 |
+
```
|
130 |
+
cd acceleration/t2v-turbo
|
131 |
+
|
132 |
+
# for 0.7 B model
|
133 |
+
sh train_07B.sh
|
134 |
+
|
135 |
+
# for 0.9 B model
|
136 |
+
sh train_09B.sh
|
137 |
+
```
|
138 |
+
|
139 |
+
|
140 |
+
# Inference
|
141 |
+
|
142 |
+
```
|
143 |
+
# for 0.7B model
|
144 |
+
python inference_command_config_07B.py
|
145 |
+
|
146 |
+
# for 0.9B model
|
147 |
+
python inference_command_config_09B.py
|
148 |
+
```
|
149 |
|
150 |
# License
|
151 |
Copyright (c) 2024 Advanced Micro Devices, Inc. All Rights Reserved.
|