prolongvid
/

prolongvid_7B

Video-Text-to-Text

text-generation

text-generation-inference

Model card Files Files and versions

prolongvid_7B

Model Summary

The ProLongVid-v1 models are 7B parameter models trained on ProLongVid_data, based on our extended Qwen2.5 language model with a context window of 256K tokens.

We suggest using this model with up to 256 frames.

Paper: For more details, please check our paper
Repository: ruiwang2021/ProLongVid
Languages: English, Chinese

Citation

@inproceedings{wang2025prolongvid,
  title={ProLongVid: A Simple but Strong Baseline for Long-context Video Instruction Tuning},
  author={Wang, Rui and Li, Bohao and Dai, Xiyang and Yang, Jianwei and Chen, Yi-Ling and Xing, Zhen and Yang, Yifan and Chen, Dongdong and Qiu, Xipeng and Wu, Zuxuan and others},
  booktitle={EMNLP},
  year={2025}
}

Downloads last month: 1

Safetensors

Model size

8B params

Tensor type

F16

·

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for prolongvid/prolongvid_7B

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Finetuned

prolongvid/prolongvid_image_sft_7B

Finetuned

prolongvid/prolongvid_stage1_7B

Finetuned

prolongvid/prolongvid_stage2_7B

Finetuned

(1)

this model

Dataset used to train prolongvid/prolongvid_7B

Collection including prolongvid/prolongvid_7B

ProLongVid_v1

5 items • Updated Nov 8, 2025 • 1

Evaluation results

accuracy on MLVU
self-reported

70.600
accuracy on LongVideoBench
self-reported

60.000
accuracy on VideoMME (wo-sub)
self-reported

64.700
accuracy on VideoMME (w-sub)
self-reported

70.700