MTVCraft

An Open Veo3-style Audio-Video Generation Demo

For the best experience, please enable audio.

🎬 Pipeline

MTVCraft is a framework for generating videos with synchronized audio from a single text prompt, exploring a potential pipeline for creating general audio-visual content.

Specifically, the framework consists of a multi-stage pipeline. First, MTVCraft employs the Qwen3 to interpret the user's initial prompt, deconstructing it into separate descriptions for three audio categories: human speech, sound effects, and background music. Subsequently, these descriptions are fed into ElevenLabs to synthesize the corresponding audio tracks. Finally, these generated audio tracks serve as conditions to guide the MTV framework in generating a video that is temporally synchronized with the sound.

Notably, both Qwen3 and ElevenLabs can be replaced by available alternatives with similar capabilities.

⚙️ Installation

For CUDA 12.1, you can install the dependencies with the following commands. Otherwise, you need to manually install torch, torchvision , torchaudio and xformers.

Download the codes:

git clone https://github.com/baaivision/MTVCraft
cd MTVCraft

Create conda environment:

conda create -n mtv python=3.10
conda activate mtv

Install packages with pip

pip install -r requirements.txt

Besides, ffmpeg is also needed:

apt-get install ffmpeg

📥 Download Pretrained Models

You can easily get all pretrained models required by inference from our HuggingFace repo.

Using huggingface-cli to download the models:

cd $ProjectRootDir
pip install "huggingface_hub[cli]"
huggingface-cli download BAAI/MTVCraft --local-dir ./pretrained_models

Or you can download them separately from their source repo:

mtv: Our checkpoints
t5-v1_1-xxl: text encoder, you can download from text_encoder and tokenizer
vae: Cogvideox-5b pretrained 3d vae
wav2vec: wav audio to vector model from Facebook

Finally, these pretrained models should be organized as follows:

./pretrained_models/
|-- mtv
|   |--single/
|   |   |-- 1/
|   |     |-- mp_rank_00_model_states.pt
|   |   `--latest
|   |
|   |--multi/
|   |   |-- 1/
|   |	  |-- mp_rank_00_model_states.pt
|   |   `-- latest
|   |
|   `--accm/
|       |-- 1/
|         |-- mp_rank_00_model_states.pt
|       `-- latest
|
|-- t5-v1_1-xxl/
|   |-- config.json
|   |-- model-00001-of-00002.safetensors
|   |-- model-00002-of-00002.safetensors
|   |-- model.safetensors.index.json
|   |-- special_tokens_map.json
|   |-- spiece.model
|   `-- tokenizer_config.json
|
|-- vae/
|   |--3d-vae.pt
|
`-- wav2vec2-base-960h/
    |-- config.json
    |-- feature_extractor_config.json
    |-- model.safetensors
    |-- preprocessor_config.json
    |-- special_tokens_map.json
    |-- tokenizer_config.json
    `-- vocab.json

🎮 Run Inference

API Setup (Required)

Before running the inference script, make sure to configure your API keys in the file mtv/utils.py. Edit the following section:

# mtv/utils.py

qwen_model_name = "qwen-plus"  # or another model name you prefer
qwen_api_key = "YOUR_QWEN_API_KEY"  # replace with your actual Qwen API key

client = OpenAI(
    api_key=qwen_api_key,
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)

elevenlabs = ElevenLabs(
    api_key="YOUR_ELEVENLABS_API_KEY",  # replace with your actual ElevenLabs API key
)

Batch

Once the API keys are set, you can run inference using the provided script:

bash scripts/inference_long.sh ./examples/samples.txt ouput_dir

This will read the input prompts from ./examples/samples.txt and the results will be saved at ./output.

Gradio UI

To run the Gradio UI simply run:

bash scripts/app.sh ouput_dir

📝 Citation

If you find our work useful for your research, please consider citing the paper:

@article{MTV,
      title={Audio-Sync Video Generation with Multi-Stream Temporal Control},
      author={Weng, Shuchen and Zheng, Haojie and Chang, Zheng and Li, Si and Shi, Boxin and Wang, Xinlong},
      journal={arXiv preprint arXiv:2506.08003},
      year={2025}
}