Add link to paper in description

#20
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +9 -9
README.md CHANGED
@@ -1,6 +1,6 @@
1
  ---
2
- library_name: transformers
3
- license: apache-2.0
4
  datasets:
5
  - HuggingFaceM4/the_cauldron
6
  - HuggingFaceM4/Docmatix
@@ -14,21 +14,21 @@ datasets:
14
  - TIGER-Lab/VISTA-400K
15
  - Enxin/MovieChat-1K_train
16
  - ShareGPT4Video/ShareGPT4Video
 
 
 
 
17
  pipeline_tag: image-text-to-text
18
  tags:
19
  - video-text-to-text
20
- language:
21
- - en
22
- base_model:
23
- - HuggingFaceTB/SmolVLM-Instruct
24
  ---
25
 
26
-
27
  <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/SmolVLM2_banner.png" width="800" height="auto" alt="Image description">
28
 
29
  # SmolVLM2 2.2B
30
 
31
- SmolVLM2-2.2B is a lightweight multimodal model designed to analyze video content. The model processes videos, images, and text inputs to generate text outputs - whether answering questions about media files, comparing visual content, or transcribing text from images. Despite its compact size, requiring only 5.2GB of GPU RAM for video inference, it delivers robust performance on complex multimodal tasks. This efficiency makes it particularly well-suited for on-device applications where computational resources may be limited.
32
  ## Model Summary
33
 
34
  - **Developed by:** Hugging Face 🤗
@@ -282,4 +282,4 @@ In the following plots we give a general overview of the samples across modaliti
282
  | vista-400k/combined | 2.2% |
283
  | vript/long | 1.0% |
284
  | ShareGPT4Video/all | 0.8% |
285
-
 
1
  ---
2
+ base_model:
3
+ - HuggingFaceTB/SmolVLM-Instruct
4
  datasets:
5
  - HuggingFaceM4/the_cauldron
6
  - HuggingFaceM4/Docmatix
 
14
  - TIGER-Lab/VISTA-400K
15
  - Enxin/MovieChat-1K_train
16
  - ShareGPT4Video/ShareGPT4Video
17
+ language:
18
+ - en
19
+ library_name: transformers
20
+ license: apache-2.0
21
  pipeline_tag: image-text-to-text
22
  tags:
23
  - video-text-to-text
 
 
 
 
24
  ---
25
 
26
+ ```markdown
27
  <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/SmolVLM2_banner.png" width="800" height="auto" alt="Image description">
28
 
29
  # SmolVLM2 2.2B
30
 
31
+ SmolVLM2-2.2B is a lightweight multimodal model designed to analyze video content. The model processes videos, images, and text inputs to generate text outputs - whether answering questions about media files, comparing visual content, or transcribing text from images. The model is described in the paper [](https://huggingface.co/papers/2504.05299). Despite its compact size, requiring only 5.2GB of GPU RAM for video inference, it delivers robust performance on complex multimodal tasks. This efficiency makes it particularly well-suited for on-device applications where computational resources may be limited.
32
  ## Model Summary
33
 
34
  - **Developed by:** Hugging Face 🤗
 
282
  | vista-400k/combined | 2.2% |
283
  | vript/long | 1.0% |
284
  | ShareGPT4Video/all | 0.8% |
285
+ ```