Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,193 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
<h1 align='center'>MTVCraft</h1>
|
2 |
+
<h2 align='center'>An Open Veo3-style Audio-Video Generation Demo</h2>
|
3 |
+
|
4 |
+
|
5 |
+
<table align='center' border="0" style="width: 100%; text-align: center; margin-top: 80px;">
|
6 |
+
<tr>
|
7 |
+
<td align="center">
|
8 |
+
<video controls width="100%">
|
9 |
+
<source src="https://github.com/user-attachments/assets/b63d1f73-04a6-42fc-abd2-c5ebe0e76d46" type="video/mp4">
|
10 |
+
Sorry, your browser does not support the video tag.
|
11 |
+
</video>
|
12 |
+
<em>For the best experience, please enable audio.</em>
|
13 |
+
</td>
|
14 |
+
</tr>
|
15 |
+
</table>
|
16 |
+
|
17 |
+
<p align="center">
|
18 |
+
<a href="https://github.com/baaivision/MTVCraft">
|
19 |
+
<img src="https://img.shields.io/badge/Project%20Page-MTVCraft-yellow">
|
20 |
+
</a>
|
21 |
+
<a href="https://arxiv.org/pdf/2506.08003">
|
22 |
+
<img src="https://img.shields.io/badge/arXiv%20paper-2506.18871-red">
|
23 |
+
</a>
|
24 |
+
<a href="https://5a69dbd78850481972.gradio.live/">
|
25 |
+
<img src="https://img.shields.io/badge/Online%20Demo-🤗-blue">
|
26 |
+
</a>
|
27 |
+
</p>
|
28 |
+
|
29 |
+
<p align="center">
|
30 |
+
<a href="#pipeline">Pipeline</a> |
|
31 |
+
<a href="#installation">Installation</a> |
|
32 |
+
<a href="#download pretrained models">Models</a> |
|
33 |
+
<a href="#run inference">Inference</a> |
|
34 |
+
<a href="#citation">Citation</a> |
|
35 |
+
</p>
|
36 |
+
|
37 |
+
|
38 |
+
|
39 |
+
## 🎬 Pipeline
|
40 |
+
|
41 |
+
MTVCraft is a framework for generating videos with synchronized audio from a single text prompt, exploring a potential pipeline for creating general audio-visual content.
|
42 |
+
|
43 |
+
Specifically, the framework consists of a multi-stage pipeline. First, MTVCraft employs the [Qwen3](https://bailian.console.aliyun.com/?tab=model#/model-market/detail/qwen3?modelGroup=qwen3) to interpret the user's initial prompt, deconstructing it into separate descriptions for three audio categories: human speech, sound effects, and background music. Subsequently, these descriptions are fed into [ElevenLabs](https://elevenlabs.io/) to synthesize the corresponding audio tracks. Finally, these generated audio tracks serve as conditions to guide the [MTV framework](https://arxiv.org/pdf/2506.08003) in generating a video that is temporally synchronized with the sound.
|
44 |
+
|
45 |
+
Notably, both Qwen3 and ElevenLabs can be replaced by available alternatives with similar capabilities.
|
46 |
+
|
47 |
+
<div align="center">
|
48 |
+
|
49 |
+

|
50 |
+
|
51 |
+
</div>
|
52 |
+
|
53 |
+
|
54 |
+
|
55 |
+
## ⚙️ Installation
|
56 |
+
|
57 |
+
For CUDA 12.1, you can install the dependencies with the following commands. Otherwise, you need to manually install `torch`, `torchvision` , `torchaudio` and `xformers`.
|
58 |
+
|
59 |
+
Download the codes:
|
60 |
+
|
61 |
+
```bash
|
62 |
+
git clone https://github.com/suimuc/MTVCraft
|
63 |
+
cd MTVCraft
|
64 |
+
```
|
65 |
+
|
66 |
+
Create conda environment:
|
67 |
+
|
68 |
+
```bash
|
69 |
+
conda create -n mtv python=3.10
|
70 |
+
conda activate mtv
|
71 |
+
```
|
72 |
+
|
73 |
+
Install packages with `pip`
|
74 |
+
|
75 |
+
```bash
|
76 |
+
pip install -r requirements.txt
|
77 |
+
```
|
78 |
+
|
79 |
+
Besides, ffmpeg is also needed:
|
80 |
+
|
81 |
+
```bash
|
82 |
+
apt-get install ffmpeg
|
83 |
+
```
|
84 |
+
|
85 |
+
## 📥 Download Pretrained Models
|
86 |
+
|
87 |
+
You can easily get all pretrained models required by inference from our [HuggingFace repo](https://huggingface.co/BAAI/MTV).
|
88 |
+
|
89 |
+
Using `huggingface-cli` to download the models:
|
90 |
+
|
91 |
+
```shell
|
92 |
+
cd $ProjectRootDir
|
93 |
+
pip install "huggingface_hub[cli]"
|
94 |
+
huggingface-cli download BAAI/MTV --local-dir ./pretrained_models
|
95 |
+
```
|
96 |
+
|
97 |
+
Or you can download them separately from their source repo:
|
98 |
+
|
99 |
+
- [mtv](https://huggingface.co/BAAI/MTV/tree/main/mtv): Our checkpoints
|
100 |
+
- [t5-v1_1-xxl](https://huggingface.co/google/t5-v1_1-xxl): text encoder, you can download from [text_encoder](https://huggingface.co/THUDM/CogVideoX-2b/tree/main/text_encoder) and [tokenizer](https://huggingface.co/THUDM/CogVideoX-2b/tree/main/tokenizer)
|
101 |
+
- [vae](https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT/tree/main/vae): Cogvideox-5b pretrained 3d vae
|
102 |
+
- [wav2vec](https://huggingface.co/facebook/wav2vec2-base-960h): wav audio to vector model from [Facebook](https://huggingface.co/facebook/wav2vec2-base-960h)
|
103 |
+
|
104 |
+
Finally, these pretrained models should be organized as follows:
|
105 |
+
|
106 |
+
```text
|
107 |
+
./pretrained_models/
|
108 |
+
|-- mtv
|
109 |
+
| |--single/
|
110 |
+
| | |-- 1/
|
111 |
+
| | |-- mp_rank_00_model_states.pt
|
112 |
+
| | `--latest
|
113 |
+
| |
|
114 |
+
| |--multi/
|
115 |
+
| | |-- 1/
|
116 |
+
| | |-- mp_rank_00_model_states.pt
|
117 |
+
| | `-- latest
|
118 |
+
| |
|
119 |
+
| `--accm/
|
120 |
+
| |-- 1/
|
121 |
+
| |-- mp_rank_00_model_states.pt
|
122 |
+
| `-- latest
|
123 |
+
|
|
124 |
+
|-- t5-v1_1-xxl/
|
125 |
+
| |-- config.json
|
126 |
+
| |-- model-00001-of-00002.safetensors
|
127 |
+
| |-- model-00002-of-00002.safetensors
|
128 |
+
| |-- model.safetensors.index.json
|
129 |
+
| |-- special_tokens_map.json
|
130 |
+
| |-- spiece.model
|
131 |
+
| `-- tokenizer_config.json
|
132 |
+
|
|
133 |
+
|-- vae/
|
134 |
+
| |--3d-vae.pt
|
135 |
+
|
|
136 |
+
`-- wav2vec2-base-960h/
|
137 |
+
|-- config.json
|
138 |
+
|-- feature_extractor_config.json
|
139 |
+
|-- model.safetensors
|
140 |
+
|-- preprocessor_config.json
|
141 |
+
|-- special_tokens_map.json
|
142 |
+
|-- tokenizer_config.json
|
143 |
+
`-- vocab.json
|
144 |
+
```
|
145 |
+
|
146 |
+
## 🎮 Run Inference
|
147 |
+
|
148 |
+
#### API Setup (Required)
|
149 |
+
Before running the inference script, make sure to configure your API keys in the file `mtv/utils.py`. Edit the following section:
|
150 |
+
```python
|
151 |
+
# mtv/utils.py
|
152 |
+
|
153 |
+
qwen_model_name = "qwen-plus" # or another model name you prefer
|
154 |
+
qwen_api_key = "YOUR_QWEN_API_KEY" # replace with your actual Qwen API key
|
155 |
+
|
156 |
+
client = OpenAI(
|
157 |
+
api_key=qwen_api_key,
|
158 |
+
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
|
159 |
+
)
|
160 |
+
|
161 |
+
elevenlabs = ElevenLabs(
|
162 |
+
api_key="YOUR_ELEVENLABS_API_KEY", # replace with your actual ElevenLabs API key
|
163 |
+
)
|
164 |
+
```
|
165 |
+
|
166 |
+
#### Batch
|
167 |
+
|
168 |
+
Once the API keys are set, you can run inference using the provided script:
|
169 |
+
|
170 |
+
```bash
|
171 |
+
bash scripts/inference_long.sh ./examples/samples.txt ouput_dir
|
172 |
+
```
|
173 |
+
This will read the input prompts from `./examples/samples.txt` and the results will be saved at `./output`.
|
174 |
+
|
175 |
+
#### Gradio UI
|
176 |
+
To run the Gradio UI simply run:
|
177 |
+
```bash
|
178 |
+
bash scripts/app.sh ouput_dir
|
179 |
+
```
|
180 |
+
|
181 |
+
|
182 |
+
## 📝 Citation
|
183 |
+
|
184 |
+
If you find our work useful for your research, please consider citing the paper:
|
185 |
+
|
186 |
+
```
|
187 |
+
@article{MTV,
|
188 |
+
title={Audio-Sync Video Generation with Multi-Stream Temporal Control},
|
189 |
+
author={Weng, Shuchen and Zheng, Haojie and Chang, Zheng and Li, Si and Shi, Boxin and Wang, Xinlong},
|
190 |
+
journal={arXiv preprint arXiv:2506.08003},
|
191 |
+
year={2025}
|
192 |
+
}
|
193 |
+
```
|