| # For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1 | |
| # Doc / guide: https://huggingface.co/docs/hub/model-cards | |
| library_name: nanovlm | |
| license: mit | |
| pipeline_tag: image-text-to-text | |
| tags: | |
| - vision-language | |
| - multimodal | |
| - research | |
| **nanoVLM** is a minimal and lightweight Vision-Language Model (VLM) designed for efficient training and experimentation. Built using pure PyTorch, the entire model architecture and training logic fits within ~750 lines of code. It combines a ViT-based image encoder (SigLIP-B/16-224-85M) with a lightweight causal language model (SmolLM2-135M), resulting in a compact 222M parameter model. | |
| For more information, check out the base model on https://huggingface.co/lusxvr/nanoVLM-222M. | |
| **Usage:** | |
| Clone the nanoVLM repository: https://github.com/huggingface/nanoVLM. | |
| Follow the install instructions and run the following code: | |
| ```python | |
| from models.vision_language_model import VisionLanguageModel | |
| model = VisionLanguageModel.from_pretrained("luodian/nanoVLM") | |
| ``` | |