Vidi2: Large Multimodal Models for Video Understanding and Creation
Paper β’ 2511.19529 β’ Published β’ 2
MLX port of ByteDance Vidi1.5-9B for Apple Silicon.
Vidi is a multimodal video temporal grounding model β give it a video and a text query, and it tells you when things happen.
ββββββββββββββββ
Video Frames βββ> β SigLip2 βββ> Vision Tokens βββ
(384x384) β 27L, 1152d β β Cross-Attention
ββββββββββββββββ β (every layer)
v
ββββββββββββββββ ββββββββββββββββ
Text Tokens ββββ> β βββ> T2T ββββ> β βββ> Output
β Gemma2-9B β β 42 Decoder β Timestamps
β 3584d β<ββ T2A <ββββ β Layers β
ββββββββββββββββ ββββββββββββββββ
^
ββββββββββββββββ β Cross-Attention
Audio (16kHz) ββ> β Whisper-v3 βββ> Audio Tokens βββ
(mel spectrogram) β 32L, 1280d β
ββββββββββββββββ
Key: Cross-attention reuses self-attention q/k/v/o weights (no extra parameters).
pip install mlx safetensors transformers opencv-python
Download the 8-bit quantized weights from HuggingFace:
# Option 1: huggingface-cli
huggingface-cli download wangjazz/Vidi1.5-9B-mlx-8bit --local-dir ./Vidi1.5-9B-mlx-8bit
# Option 2: git lfs
git lfs install
git clone https://huggingface.co/wangjazz/Vidi1.5-9B-mlx-8bit
# Video temporal grounding
python -m mlx_vidi.run \
--model-path ./Vidi1.5-9B-mlx-8bit \
--video-path ./your_video.mp4 \
--query "a person talking"
# Image mode
python -m mlx_vidi.run \
--model-path ./Vidi1.5-9B-mlx-8bit \
--image-path ./your_image.jpg \
--query "describe this image"
import mlx.core as mx
from pathlib import Path
from mlx_vidi.config import ModelConfig
from mlx_vidi.generate import VidiEngine
from mlx_vidi.quantize import quantize_engine
import json
# Load
with open("./Vidi1.5-9B-mlx-8bit/config.json") as f:
raw = json.load(f)
config = ModelConfig.from_dict(raw)
engine = VidiEngine(config)
quant = raw["quantization"]
quantize_engine(engine, bits=quant["bits"], group_size=quant["group_size"])
weights = {}
for wf in sorted(Path("./Vidi1.5-9B-mlx-8bit").glob("model-*.safetensors")):
weights.update(mx.load(str(wf)))
engine.load_weights(list(weights.items()), strict=False)
mx.eval(engine.parameters())
# Prepare inputs (see mlx_vidi/preprocessing.py)
from mlx_vidi.preprocessing import extract_video_frames, process_images, ...
# Generate
token_ids = engine.generate(
input_ids=input_ids,
pixel_values=pixel_values,
mel_features=mel_features,
audio_sizes=audio_sizes,
max_tokens=200,
temperature=0.0,
)
Vidi is a temporal grounding model. It outputs normalized timestamps (0.0-1.0):
Query: "During which time segments in the video can we see a person singing?"
Output: 0.21-0.22, 0.46-0.47
# For a 60s video β actual time: 12.6s-13.2s, 27.6s-28.2s
If you want to convert from the original PyTorch weights:
# 1. Download original weights
huggingface-cli download bytedance-research/Vidi1.5-9B --local-dir ./Vidi1.5-9B
# 2. Convert to MLX fp16
python -m mlx_vidi.convert_weights \
--input-dir ./Vidi1.5-9B \
--output-dir ./Vidi1.5-9B-mlx \
--dtype float16
# 3. Quantize to 8-bit
python -m mlx_vidi.quantize \
--input-dir ./Vidi1.5-9B-mlx \
--output-dir ./Vidi1.5-9B-mlx-8bit \
--bits 8 --group-size 64
mlx_vidi/
βββ config.py # ModelConfig dataclass
βββ model.py # Dual Attention Gemma2 (T2T + T2V + T2A cross-attention)
βββ vision_encoder.py # SigLip2-so400m-patch14-384
βββ audio_encoder.py # Whisper-large-v3 encoder
βββ projectors.py # VidiRMSNorm, MMProjector, LearnablePosEmbd, Conv2DPool, etc.
βββ generate.py # VidiEngine + KVCache + autoregressive generation
βββ preprocessing.py # Video/audio/text preprocessing (OpenCV + transformers)
βββ convert_weights.py # PyTorch safetensors β MLX safetensors
βββ quantize.py # 4-bit / 8-bit quantization
βββ run.py # CLI entry point
Tested on a 60-second music video:
| Version | Size | "guitar played" | "audience/crowd" | "face close-up" |
|---|---|---|---|---|
| fp16 | 19.6 GB | 0:11-0:12 | 7 segments | 0:49-0:51 |
| 8-bit | 11.3 GB | 0:11-0:12 | 7 segments | 0:49-0:51 |
| 4-bit | 6.9 GB | 0:00-0:01 | 2 segments | 0:49-0:51 |
8-bit maintains near-identical quality to fp16 with 42% less memory.
If you use this project, please cite the original Vidi paper:
@article{Vidi2026vidi2.5,
title={Vidi2.5: Large Multimodal Models for Video Understanding and Creation},
author={Vidi Team, Chia-Wen Kuo, Chuang Huang, Dawei Du, Fan Chen,
Fanding Lei, Feng Gao, Guang Chen, Haoji Zhang, Haojun Zhao,
Jin Liu, Jingjing Zhuge, Lili Fang, Lingxi Zhang, Longyin Wen,
Lu Guo, Lu Xu, Lusha Li, Qihang Fan, Rachel Deng, Shaobo Fang,
Shu Zhang, Sijie Zhu, Stuart Siew, Weiyan Tao, Wen Zhong,
Xiaohui Shen, Xin Gu, Ye Yuan, Yicheng He, Yiming Cui,
Zhenfang Chen, Zhihua Wu, Zuhua Lin},
journal={arXiv preprint arXiv:2511.19529},
year={2026}
}
Code: Apache 2.0
Model weights: Subject to the original Vidi model license.
8-bit