Multimodal Models
Collection
13 items
•
Updated
This version of SmolVLM2-500M-Video-Instruct has been converted to run on the Axera NPU using w8a16 quantization.
Compatible with Pulsar2 version: 4.0
For those who are interested in model conversion, you can try to export axmodel through the original repo:
https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct
Pulsar2 Link, How to Convert LLM from Huggingface to axmodel
Download all files from this repository to the device.
Using AX650 Board
ai@ai-bj ~/yongqiang/SmolVLM2-500M-Video-Instruct $ tree -L 1
.
├── assets
├── embeds
├── infer_axmodel.py
├── README.md
├── smolvlm2_axmodel
├── smolvlm2_tokenizer
└── vit_mdoel
5 directories, 2 files
Multimodal Understanding
input image
input text:
Can you describe this image?
log information:
ai@ai-bj ~/yongqiang/SmolVLM2-500M-Video-Instruct $ python3 infer_axmodel.py
input prompt: Can you describe this image?
answer >> The image depicts a close-up view of a pink flower with a bee on it. The bee, which appears to be a bumblebee, is perched on the flower's center, which is surrounded by a cluster of other flowers. The bee is in the process of collecting nectar from the flower, which is a common behavior for bees. The flower itself has a yellow center with a cluster of yellow stamens surrounding it. The petals of the flower are a vibrant shade of pink, and the bee is positioned very close to^@ the camera, making it the focal point of the image. The background of the image is slightly blurred, but it appears to be a garden or a field with other flowers and plants, contributing to the overall natural setting of the image.
Base model
HuggingFaceTB/SmolLM2-360M