InternVideo2-B14

Vision-Language Model and StreamMamba checkpoints

License: Apache-2.0

This model is licensed under the Apache-2.0 License.

Overview

InternVideo2-B14 is a family of pre-trained vision-language models designed for cross-modal video-text understanding, vision-language alignment, and efficient deployment. This repository provides modular checkpoints for various downstream tasks, including video classification and frame-skipping systems.

Base Model: OpenGVLab/InternVideo2_distillation_models

Pipeline Tag: video-classification (supports vision-language and video-only tasks)

Model Details

Included Checkpoints

Filename	Size	Description
`cross_mamba_film_warmup.pt`	504 MB	Cross-modal model combining vision and text using FiLM (Feature-wise Linear Modulation) and Mamba layers for temporal modeling.
`mamba_mobileclip_ckpt.pt`	500 MB	StreamMamba temporal aggregator trained on MobileCLIP embeddings (no FiLM). Checkpoint 6900.
`internvideo2_clip.pt`	5.55 MB	CLIP-style vision-language alignment component for InternVideo2-B14.
`internvideo2_vision.pt`	205 MB	Vision encoder backbone (InternVideo2-B14) for video feature extraction.
`mobileclip_blt.pt`	599 MB	Lightweight MobileCLIP variant (BLT) for resource-constrained applications.
`lstm_ckpt.pt`	530 MB	Contains InternVideo2-B14 weights and MobileCLIP weights, along with a trained LSTM (used for ablating against Mamba)

StreamMamba Self-Predictive Frame Skipping (SPFS)

The spfs_r64 folder contains a self-contained system for adaptive frame skipping in videos. Each checkpoint file includes:

MobileCLIP vision/text encoders
InternVideo2-B14 vision encoder weights
Mamba temporal aggregator (merged from mamba_mobileclip_ckpt.pt)
SPFS-specific weights for frame selection