BAAI
/

Text-to-Video
MTVCraft
Safetensors
whiteRavener commited on
Commit
6c985c7
·
verified ·
1 Parent(s): 3c121c8

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +193 -0
README.md ADDED
@@ -0,0 +1,193 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <h1 align='center'>MTVCraft</h1>
2
+ <h2 align='center'>An Open Veo3-style Audio-Video Generation Demo</h2>
3
+
4
+
5
+ <table align='center' border="0" style="width: 100%; text-align: center; margin-top: 80px;">
6
+ <tr>
7
+ <td align="center">
8
+ <video controls width="100%">
9
+ <source src="https://github.com/user-attachments/assets/b63d1f73-04a6-42fc-abd2-c5ebe0e76d46" type="video/mp4">
10
+ Sorry, your browser does not support the video tag.
11
+ </video>
12
+ <em>For the best experience, please enable audio.</em>
13
+ </td>
14
+ </tr>
15
+ </table>
16
+
17
+ <p align="center">
18
+ <a href="https://github.com/baaivision/MTVCraft">
19
+ <img src="https://img.shields.io/badge/Project%20Page-MTVCraft-yellow">
20
+ </a>
21
+ <a href="https://arxiv.org/pdf/2506.08003">
22
+ <img src="https://img.shields.io/badge/arXiv%20paper-2506.18871-red">
23
+ </a>
24
+ <a href="https://5a69dbd78850481972.gradio.live/">
25
+ <img src="https://img.shields.io/badge/Online%20Demo-🤗-blue">
26
+ </a>
27
+ </p>
28
+
29
+ <p align="center">
30
+ <a href="#pipeline">Pipeline</a> |
31
+ <a href="#installation">Installation</a> |
32
+ <a href="#download pretrained models">Models</a> |
33
+ <a href="#run inference">Inference</a> |
34
+ <a href="#citation">Citation</a> |
35
+ </p>
36
+
37
+
38
+
39
+ ## 🎬 Pipeline
40
+
41
+ MTVCraft is a framework for generating videos with synchronized audio from a single text prompt, exploring a potential pipeline for creating general audio-visual content.
42
+
43
+ Specifically, the framework consists of a multi-stage pipeline. First, MTVCraft employs the [Qwen3](https://bailian.console.aliyun.com/?tab=model#/model-market/detail/qwen3?modelGroup=qwen3) to interpret the user's initial prompt, deconstructing it into separate descriptions for three audio categories: human speech, sound effects, and background music. Subsequently, these descriptions are fed into [ElevenLabs](https://elevenlabs.io/) to synthesize the corresponding audio tracks. Finally, these generated audio tracks serve as conditions to guide the [MTV framework](https://arxiv.org/pdf/2506.08003) in generating a video that is temporally synchronized with the sound.
44
+
45
+ Notably, both Qwen3 and ElevenLabs can be replaced by available alternatives with similar capabilities.
46
+
47
+ <div align="center">
48
+
49
+ ![pipeline](https://github.com/baaivision/MTVCraft/blob/main/assets/pipeline.png)
50
+
51
+ </div>
52
+
53
+
54
+
55
+ ## ⚙️ Installation
56
+
57
+ For CUDA 12.1, you can install the dependencies with the following commands. Otherwise, you need to manually install `torch`, `torchvision` , `torchaudio` and `xformers`.
58
+
59
+ Download the codes:
60
+
61
+ ```bash
62
+ git clone https://github.com/suimuc/MTVCraft
63
+ cd MTVCraft
64
+ ```
65
+
66
+ Create conda environment:
67
+
68
+ ```bash
69
+ conda create -n mtv python=3.10
70
+ conda activate mtv
71
+ ```
72
+
73
+ Install packages with `pip`
74
+
75
+ ```bash
76
+ pip install -r requirements.txt
77
+ ```
78
+
79
+ Besides, ffmpeg is also needed:
80
+
81
+ ```bash
82
+ apt-get install ffmpeg
83
+ ```
84
+
85
+ ## 📥 Download Pretrained Models
86
+
87
+ You can easily get all pretrained models required by inference from our [HuggingFace repo](https://huggingface.co/BAAI/MTV).
88
+
89
+ Using `huggingface-cli` to download the models:
90
+
91
+ ```shell
92
+ cd $ProjectRootDir
93
+ pip install "huggingface_hub[cli]"
94
+ huggingface-cli download BAAI/MTV --local-dir ./pretrained_models
95
+ ```
96
+
97
+ Or you can download them separately from their source repo:
98
+
99
+ - [mtv](https://huggingface.co/BAAI/MTV/tree/main/mtv): Our checkpoints
100
+ - [t5-v1_1-xxl](https://huggingface.co/google/t5-v1_1-xxl): text encoder, you can download from [text_encoder](https://huggingface.co/THUDM/CogVideoX-2b/tree/main/text_encoder) and [tokenizer](https://huggingface.co/THUDM/CogVideoX-2b/tree/main/tokenizer)
101
+ - [vae](https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT/tree/main/vae): Cogvideox-5b pretrained 3d vae
102
+ - [wav2vec](https://huggingface.co/facebook/wav2vec2-base-960h): wav audio to vector model from [Facebook](https://huggingface.co/facebook/wav2vec2-base-960h)
103
+
104
+ Finally, these pretrained models should be organized as follows:
105
+
106
+ ```text
107
+ ./pretrained_models/
108
+ |-- mtv
109
+ | |--single/
110
+ | | |-- 1/
111
+ | | |-- mp_rank_00_model_states.pt
112
+ | | `--latest
113
+ | |
114
+ | |--multi/
115
+ | | |-- 1/
116
+ | | |-- mp_rank_00_model_states.pt
117
+ | | `-- latest
118
+ | |
119
+ | `--accm/
120
+ | |-- 1/
121
+ | |-- mp_rank_00_model_states.pt
122
+ | `-- latest
123
+ |
124
+ |-- t5-v1_1-xxl/
125
+ | |-- config.json
126
+ | |-- model-00001-of-00002.safetensors
127
+ | |-- model-00002-of-00002.safetensors
128
+ | |-- model.safetensors.index.json
129
+ | |-- special_tokens_map.json
130
+ | |-- spiece.model
131
+ | `-- tokenizer_config.json
132
+ |
133
+ |-- vae/
134
+ | |--3d-vae.pt
135
+ |
136
+ `-- wav2vec2-base-960h/
137
+ |-- config.json
138
+ |-- feature_extractor_config.json
139
+ |-- model.safetensors
140
+ |-- preprocessor_config.json
141
+ |-- special_tokens_map.json
142
+ |-- tokenizer_config.json
143
+ `-- vocab.json
144
+ ```
145
+
146
+ ## 🎮 Run Inference
147
+
148
+ #### API Setup (Required)
149
+ Before running the inference script, make sure to configure your API keys in the file `mtv/utils.py`. Edit the following section:
150
+ ```python
151
+ # mtv/utils.py
152
+
153
+ qwen_model_name = "qwen-plus" # or another model name you prefer
154
+ qwen_api_key = "YOUR_QWEN_API_KEY" # replace with your actual Qwen API key
155
+
156
+ client = OpenAI(
157
+ api_key=qwen_api_key,
158
+ base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
159
+ )
160
+
161
+ elevenlabs = ElevenLabs(
162
+ api_key="YOUR_ELEVENLABS_API_KEY", # replace with your actual ElevenLabs API key
163
+ )
164
+ ```
165
+
166
+ #### Batch
167
+
168
+ Once the API keys are set, you can run inference using the provided script:
169
+
170
+ ```bash
171
+ bash scripts/inference_long.sh ./examples/samples.txt ouput_dir
172
+ ```
173
+ This will read the input prompts from `./examples/samples.txt` and the results will be saved at `./output`.
174
+
175
+ #### Gradio UI
176
+ To run the Gradio UI simply run:
177
+ ```bash
178
+ bash scripts/app.sh ouput_dir
179
+ ```
180
+
181
+
182
+ ## 📝 Citation
183
+
184
+ If you find our work useful for your research, please consider citing the paper:
185
+
186
+ ```
187
+ @article{MTV,
188
+ title={Audio-Sync Video Generation with Multi-Stream Temporal Control},
189
+ author={Weng, Shuchen and Zheng, Haojie and Chang, Zheng and Li, Si and Shi, Boxin and Wang, Xinlong},
190
+ journal={arXiv preprint arXiv:2506.08003},
191
+ year={2025}
192
+ }
193
+ ```