HuggingFaceTB
/

SmolVLM2-2.2B-Instruct

Image-Text-to-Text

video-text-to-text

Model card Files Files and versions

Add link to paper in description

#20

by nielsr HF Staff - opened Apr 9

base: refs/heads/main

←

from: refs/pr/20

Discussion Files changed

Files changed (1) hide show

README.md +9 -9

README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
-library_name: transformers
-license: apache-2.0
 datasets:
 - HuggingFaceM4/the_cauldron
 - HuggingFaceM4/Docmatix
@@ -14,21 +14,21 @@ datasets:
 - TIGER-Lab/VISTA-400K
 - Enxin/MovieChat-1K_train
 - ShareGPT4Video/ShareGPT4Video
 pipeline_tag: image-text-to-text
 tags:
 - video-text-to-text
-language:
-- en
-base_model:
-- HuggingFaceTB/SmolVLM-Instruct
 ---
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/SmolVLM2_banner.png" width="800" height="auto" alt="Image description">
 # SmolVLM2 2.2B
-SmolVLM2-2.2B is a lightweight multimodal model designed to analyze video content. The model processes videos, images, and text inputs to generate text outputs - whether answering questions about media files, comparing visual content, or transcribing text from images. Despite its compact size, requiring only 5.2GB of GPU RAM for video inference, it delivers robust performance on complex multimodal tasks. This efficiency makes it particularly well-suited for on-device applications where computational resources may be limited.
 ## Model Summary
 - **Developed by:** Hugging Face 🤗
@@ -282,4 +282,4 @@ In the following plots we give a general overview of the samples across modaliti
 | vista-400k/combined                        | 2.2%       |
 | vript/long                                 | 1.0%       |
 | ShareGPT4Video/all                         | 0.8%       |

 ---
+base_model:
+- HuggingFaceTB/SmolVLM-Instruct
 datasets:
 - HuggingFaceM4/the_cauldron
 - HuggingFaceM4/Docmatix
 - TIGER-Lab/VISTA-400K
 - Enxin/MovieChat-1K_train
 - ShareGPT4Video/ShareGPT4Video
+language:
+- en
+library_name: transformers
+license: apache-2.0
 pipeline_tag: image-text-to-text
 tags:
 - video-text-to-text
 ---
+```markdown
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/SmolVLM2_banner.png" width="800" height="auto" alt="Image description">
 # SmolVLM2 2.2B
+SmolVLM2-2.2B is a lightweight multimodal model designed to analyze video content. The model processes videos, images, and text inputs to generate text outputs - whether answering questions about media files, comparing visual content, or transcribing text from images. The model is described in the paper [](https://huggingface.co/papers/2504.05299). Despite its compact size, requiring only 5.2GB of GPU RAM for video inference, it delivers robust performance on complex multimodal tasks. This efficiency makes it particularly well-suited for on-device applications where computational resources may be limited.
 ## Model Summary
 - **Developed by:** Hugging Face 🤗
 | vista-400k/combined                        | 2.2%       |
 | vript/long                                 | 1.0%       |
 | ShareGPT4Video/all                         | 0.8%       |
+```