FastVLM-0.5B

This version of FastVLM-0.5B has been converted to run on the Axera NPU using w8a16 quantization.

This model has been optimized with the following LoRA:

Compatible with Pulsar2 version: 5.1-patch1.

Please note that the context of the 650 model is 1k and the maximum prefill length is 640 tokens, and the context of the 620e/630c model is 512 and the maximum prefill length is 256 tokens.

Convert tools links:

For those who are interested in model conversion, you can try to quant and export axmodel through the original repo:

https://huggingface.co/apple/FastVLM-0.5B

How to Convert LLM from Huggingface to axmodel[TODO]

Support Platform

AX650
- AX650N DEMO Board
- M4N-Dock(爱芯派Pro)
- M.2 Accelerator card
AX630C

Chips	image encoder	ttft	w4a16
AX650N	44.572 ms (512x512)	94.532 ms (99tokens)	34.81 tokens/sec
AX630C	205.961 ms (512x512)	489.013 ms (99tokens)	11.67 tokens/sec

How to use

Download all files from this repository to the device

$tree -L 1
.
|-- README.md
|-- config.json
|-- embeds
|-- fastvlm_C128_CTX1024_P640_ax650
|-- fastvlm_C128_CTX512_P256_ax620e
|-- fastvlm_tokenizer
|-- images
|-- infer_axmodel_620e.py
|-- infer_axmodel_650.py
|-- requirements.txt
`-- utils

6 directories, 5 files

Install transformer

pip install -r requirements.txt

Inference with AX630C Host

Run the following command on the Axera board to start a chat conversation:

python3 infer_axmodel_620e.py -v ./fastvlm_C128_CTX512_P256_ax620e/image_encoder_512x512_ax620e.axmodel -m ./fastvlm_C128_CTX512_P256_ax620e -t fastvlm_tokenizer -i 512

output:

[INFO] Available providers:  ['AxEngineExecutionProvider']
Loading config, tokenizer and init model.
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Chip type: ChipType.MC20E
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Engine version: 2.7.2a
[INFO] Model type: 1 (full core)
[INFO] Compiler version: 5.1-patch1-dirty 0a5b164f-dirty
Detected prefixes: ['llava_qwen2'], chosen: llava_qwen2, layers: 24
Init InferenceSession:   0%|                                                                                                                          | 0/24 [00:00<?, ?it/s][INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 1 (full core)
[INFO] Compiler version: 5.1-patch1-dirty 0a5b164f-dirty
Init InferenceSession:   4%|████▊                                                                                                             | 1/24 
[00:02<00:00,  9.25it/s]
...
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 1 (full core)
[INFO] Compiler version: 5.1-patch1-dirty 0a5b164f-dirty
Init InferenceSession: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:02<00:00,  9.12it/s]
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 1 (full core)
[INFO] Compiler version: 5.1-patch1-dirty 0a5b164f-dirty
Model loaded successfully!
[INFO]: 输入文本进行对话，或者输入图片路径进行图片理解, 或者输入q退出对话。
prompt<<who are you
slice_indices: [0]
Slice prefill done: 0
answer >> I'm an AI language model, I don't have personal identity or a physical body. I exist solely as a digital creation created by Apple. I don't have a name or a personal identity. I'm designed to assist and provide information to users. Is there anything else I can help you with?

prompt<<./images/ssd_horse.jpg
slice_indices: [0]
Slice prefill done: 0
answer >> The image depicts a person riding a brown horse with a white blaze on its face. The rider is wearing a blue hoodie and blue jeans, and is holding the reins of the horse. The horse is standing on a dirt ground with some grass and trees in the background. The rider is also holding a rope in their left hand, which is attached to the horse's harness.

To the left of the horse, there is a brown dog standing on the ground, looking up at the rider. The dog appears to be in a begging or pleading position, with its front paws raised and its mouth open.

In the background, there is a gray pickup truck parked on the grass, and a wooden fence can be seen behind the horse and rider. There are also some people visible in the background, including a person in a red shirt and another person in a blue shirt. The overall scene appears to be taking place in an outdoor setting, possibly a ranch or a farm.

prompt<<./images/image_1.jpg
slice_indices: [0]
Slice prefill done: 0
answer >> The image depicts a panda bear in a naturalistic enclosure, likely within a zoo or wildlife sanctuary. The panda is sitting on its hind legs, with its front paws resting on a wooden structure that resembles a tree stump. The panda's distinctive black and white fur is clearly visible, with the black fur covering its ears, eyes, and the area around its nose and mouth, while the white fur covers the rest of its body. The panda's black nose and the black fur around its mouth are also visible.

The panda is surrounded by green foliage, including bamboo shoots and other plants, which are typical of a panda's natural habitat. The ground appears to be covered with dirt and small rocks, and there are some larger rocks and a tree stump in the background. The lighting in the image suggests that it was taken during the daytime, with natural light illuminating the scene. The overall setting appears to be a well-maintained and naturalistic enclosure designed to mimic the panda's natural environment.

prompt<<q
[INFO]: 对话结束，再见。

Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650 DEMO Board

Run the following command on the Axera board to start a chat conversation:

$ python3 infer_axmodel_650.py -v ./fastvlm_C128_CTX1024_P640_ax650/image_encoder_512x512_0.5b_ax650.axmodel -m ./fastvlm_C128_CTX1024_P640_ax650 -t fastvlm_tokenizer -i 512

output:

[INFO] Available providers:  ['AxEngineExecutionProvider', 'AXCLRTExecutionProvider']
Loading config, tokenizer and init model.
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Chip type: ChipType.MC50
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Engine version: 2.12.0s
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-patch1-dirty 0a5b164f-dirty
Detected prefixes: ['llava_qwen2'], chosen: llava_qwen2, layers: 24
Init InferenceSession:   0%|                                                                                                                          | 0/24 [00:00<?, ?it/s][INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-patch1-dirty 0a5b164f-dirty
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-patch1-dirty 0a5b164f-dirty
Init InferenceSession:   8%|█████████▌                                                                                                        | 2/24 [00:00<00:01, 17.39it/s][INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-patch1-dirty 0a5b164f-dirty
[INFO] Using provider: AxEngineExecutionProvider
...
Init InferenceSession: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:00<00:00, 24.30it/s]
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-patch1-dirty 0a5b164f-dirty
Model loaded successfully!
[INFO]: 输入文本进行对话，或者输入图片路径进行图片理解, 或者输入q退出对话。
prompt<<who are you
slice_indices: [0]
Slice prefill done: 0
answer >> I'm an AI language model, I don't have personal identity or a physical body. I exist solely as a digital entity designed to assist and provide information to users. I don't have a name or a personal identity, but I can provide information and answer questions based on my training data and algorithms. Is there something specific you would like to know about me?

prompt<<./images/ssd_horse.jpg
slice_indices: [0]
Slice prefill done: 0
answer >> The image depicts a person riding a brown horse with a white blaze on its face. The rider is wearing a gray hoodie and blue jeans, and is holding the reins of the horse. The horse is standing on a dirt ground with some grass and trees in the background.

To the left of the horse, there is a brown dog sitting on the ground. The dog is looking up at the rider with its mouth open, as if it is begging or reacting to something.

In the background, there is a gray pickup truck parked on the grass, and a person wearing a red shirt and blue jeans is standing near the truck. There is also a wooden fence and some trees in the background.

The overall scene appears to be taking place in a rural or outdoor setting, possibly a farm or ranch.

prompt<<./images/image_1.jpg
slice_indices: [0]
Slice prefill done: 0
answer >> The image depicts a panda bear in a naturalistic enclosure, likely within a zoo or wildlife sanctuary. The panda is lying on its stomach with its head resting on its front paws, appearing relaxed and content. The panda's distinctive black and white fur is clearly visible, with the black fur covering its ears, eyes, and limbs, while the white fur covers its face, neck, and the underside of its body. The panda's black nose and mouth are also visible.

The panda is surrounded by green foliage, including bamboo shoots and other plants, which are typical of a panda's natural habitat. In the background, there is a wooden structure that resembles a tree stump or a small tree, adding to the naturalistic setting. The ground is covered with dirt and leaves, further emphasizing the natural environment.

The lighting in the image is natural, suggesting that the photo was taken during the day. The overall scene conveys a sense of tranquility and the panda's comfort in its environment.

prompt<<q
[INFO]: 对话结束，再见。

Downloads last month: 15

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AXERA-TECH/FastVLM-0.5B

Base model

apple/FastVLM-0.5B

Finetuned

(4)

this model