MTVCraft
An Open Veo3-style Audio-Video Generation Demo
For the best experience, please enable audio. |
๐ฌ Pipeline
MTVCraft is a framework for generating videos with synchronized audio from a single text prompt, exploring a potential pipeline for creating general audio-visual content.
Specifically, the framework consists of a multi-stage pipeline. First, MTVCraft employs the Qwen3 to interpret the user's initial prompt, deconstructing it into separate descriptions for three audio categories: human speech, sound effects, and background music. Subsequently, these descriptions are fed into ElevenLabs to synthesize the corresponding audio tracks. Finally, these generated audio tracks serve as conditions to guide the MTV framework in generating a video that is temporally synchronized with the sound.
Notably, both Qwen3 and ElevenLabs can be replaced by available alternatives with similar capabilities.

โ๏ธ Installation
For CUDA 12.1, you can install the dependencies with the following commands. Otherwise, you need to manually install torch
, torchvision
, torchaudio
and xformers
.
Download the codes:
git clone https://github.com/baaivision/MTVCraft
cd MTVCraft
Create conda environment:
conda create -n mtv python=3.10
conda activate mtv
Install packages with pip
pip install -r requirements.txt
Besides, ffmpeg is also needed:
apt-get install ffmpeg
๐ฅ Download Pretrained Models
You can easily get all pretrained models required by inference from our HuggingFace repo.
Using huggingface-cli
to download the models:
cd $ProjectRootDir
pip install "huggingface_hub[cli]"
huggingface-cli download BAAI/MTVCraft --local-dir ./pretrained_models
Or you can download them separately from their source repo:
- mtv: Our checkpoints
- t5-v1_1-xxl: text encoder, you can download from text_encoder and tokenizer
- vae: Cogvideox-5b pretrained 3d vae
- wav2vec: wav audio to vector model from Facebook
Finally, these pretrained models should be organized as follows:
./pretrained_models/
|-- mtv
| |--single/
| | |-- 1/
| | |-- mp_rank_00_model_states.pt
| | `--latest
| |
| |--multi/
| | |-- 1/
| | |-- mp_rank_00_model_states.pt
| | `-- latest
| |
| `--accm/
| |-- 1/
| |-- mp_rank_00_model_states.pt
| `-- latest
|
|-- t5-v1_1-xxl/
| |-- config.json
| |-- model-00001-of-00002.safetensors
| |-- model-00002-of-00002.safetensors
| |-- model.safetensors.index.json
| |-- special_tokens_map.json
| |-- spiece.model
| `-- tokenizer_config.json
|
|-- vae/
| |--3d-vae.pt
|
`-- wav2vec2-base-960h/
|-- config.json
|-- feature_extractor_config.json
|-- model.safetensors
|-- preprocessor_config.json
|-- special_tokens_map.json
|-- tokenizer_config.json
`-- vocab.json
๐ฎ Run Inference
API Setup (Required)
Before running the inference script, make sure to configure your API keys in the file mtv/utils.py
. Edit the following section:
# mtv/utils.py
qwen_model_name = "qwen-plus" # or another model name you prefer
qwen_api_key = "YOUR_QWEN_API_KEY" # replace with your actual Qwen API key
client = OpenAI(
api_key=qwen_api_key,
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
elevenlabs = ElevenLabs(
api_key="YOUR_ELEVENLABS_API_KEY", # replace with your actual ElevenLabs API key
)
Batch
Once the API keys are set, you can run inference using the provided script:
bash scripts/inference_long.sh ./examples/samples.txt ouput_dir
This will read the input prompts from ./examples/samples.txt
and the results will be saved at ./output
.
Gradio UI
To run the Gradio UI simply run:
bash scripts/app.sh ouput_dir
๐ Citation
If you find our work useful for your research, please consider citing the paper:
@article{MTV,
title={Audio-Sync Video Generation with Multi-Stream Temporal Control},
author={Weng, Shuchen and Zheng, Haojie and Chang, Zheng and Li, Si and Shi, Boxin and Wang, Xinlong},
journal={arXiv preprint arXiv:2506.08003},
year={2025}
}
- Downloads last month
- 292