InternVideo2-B14
Vision-Language Model and StreamMamba checkpoints
License: Apache-2.0
This model is licensed under the Apache-2.0 License.Overview
InternVideo2-B14 is a family of pre-trained vision-language models designed for cross-modal video-text understanding, vision-language alignment, and efficient deployment. This repository provides modular checkpoints for various downstream tasks, including video classification and frame-skipping systems.
Base Model: OpenGVLab/InternVideo2_distillation_models
Pipeline Tag: video-classification
(supports vision-language and video-only tasks)
Model Details
Included Checkpoints
Filename | Size | Description |
---|---|---|
cross_mamba_film_warmup.pt |
504 MB | Cross-modal model combining vision and text using FiLM (Feature-wise Linear Modulation) and Mamba layers for temporal modeling. |
mamba_mobileclip_ckpt.pt |
500 MB | StreamMamba temporal aggregator trained on MobileCLIP embeddings (no FiLM). Checkpoint 6900. |
internvideo2_clip.pt |
5.55 MB | CLIP-style vision-language alignment component for InternVideo2-B14. |
internvideo2_vision.pt |
205 MB | Vision encoder backbone (InternVideo2-B14) for video feature extraction. |
mobileclip_blt.pt |
599 MB | Lightweight MobileCLIP variant (BLT) for resource-constrained applications. |
lstm_ckpt.pt |
530 MB | Contains InternVideo2-B14 weights and MobileCLIP weights, along with a trained LSTM (used for ablating against Mamba) |
StreamMamba Self-Predictive Frame Skipping (SPFS)
The spfs_r64
folder contains a self-contained system for adaptive frame skipping in videos. Each checkpoint file includes:
- MobileCLIP vision/text encoders
- InternVideo2-B14 vision encoder weights
- Mamba temporal aggregator (merged from
mamba_mobileclip_ckpt.pt
) - SPFS-specific weights for frame selection
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for qingy2024/InternVideo2-B14
Base model
OpenGVLab/InternVideo2_distillation_models