InternVideo2-B14

Vision-Language Model and StreamMamba checkpoints

License: Apache-2.0 This model is licensed under the Apache-2.0 License.

Overview

InternVideo2-B14 is a family of pre-trained vision-language models designed for cross-modal video-text understanding, vision-language alignment, and efficient deployment. This repository provides modular checkpoints for various downstream tasks, including video classification and frame-skipping systems.

Base Model: OpenGVLab/InternVideo2_distillation_models

Pipeline Tag: video-classification (supports vision-language and video-only tasks)


Model Details

Included Checkpoints

Filename Size Description
cross_mamba_film_warmup.pt 504 MB Cross-modal model combining vision and text using FiLM (Feature-wise Linear Modulation) and Mamba layers for temporal modeling.
mamba_mobileclip_ckpt.pt 500 MB StreamMamba temporal aggregator trained on MobileCLIP embeddings (no FiLM). Checkpoint 6900.
internvideo2_clip.pt 5.55 MB CLIP-style vision-language alignment component for InternVideo2-B14.
internvideo2_vision.pt 205 MB Vision encoder backbone (InternVideo2-B14) for video feature extraction.
mobileclip_blt.pt 599 MB Lightweight MobileCLIP variant (BLT) for resource-constrained applications.
lstm_ckpt.pt 530 MB Contains InternVideo2-B14 weights and MobileCLIP weights, along with a trained LSTM (used for ablating against Mamba)

StreamMamba Self-Predictive Frame Skipping (SPFS)

The spfs_r64 folder contains a self-contained system for adaptive frame skipping in videos. Each checkpoint file includes:

  • MobileCLIP vision/text encoders
  • InternVideo2-B14 vision encoder weights
  • Mamba temporal aggregator (merged from mamba_mobileclip_ckpt.pt)
  • SPFS-specific weights for frame selection
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for qingy2024/InternVideo2-B14

Finetuned
(1)
this model