HuggingFaceTB
/

SmolVLM-256M-Instruct

Image-Text-to-Text

Transformers

ONNX

Model card Files Files and versions

xet

Community

andito HF Staff

Xenova HF Staff commited on Feb 3

Commit

aaec0f3

verified ·

1 Parent(s): ab233b7

Add ONNX sample code (#8)

Browse files

- Add ONNX sample code (acc054744d60adea9d88214dab6e2219e6e512a7)

Co-authored-by: Joshua <[email protected]>

Files changed (1) hide show

README.md +141 -0

README.md CHANGED Viewed

@@ -105,6 +105,147 @@ To summarize, the image captures a significant historical statue of liberty, sit
 """
 ```
 ### Model optimizations

 """
 ```
+We also provide ONNX weights for the model, which you can run with ONNX Runtime as follows:
+<details>
+<summary>Click here to see the sample code</summary>
+```python
+from transformers import AutoConfig, AutoProcessor
+from transformers.image_utils import load_image
+import onnxruntime
+import numpy as np
+# 1. Load models
+## Load config and processor
+model_id = "HuggingFaceTB/SmolVLM-256M-Instruct"
+config = AutoConfig.from_pretrained(model_id)
+processor = AutoProcessor.from_pretrained(model_id)
+## Load sessions
+## !wget https://huggingface.co/HuggingFaceTB/SmolVLM-256M-Instruct/resolve/main/onnx/vision_encoder.onnx
+## !wget https://huggingface.co/HuggingFaceTB/SmolVLM-256M-Instruct/resolve/main/onnx/embed_tokens.onnx
+## !wget https://huggingface.co/HuggingFaceTB/SmolVLM-256M-Instruct/resolve/main/onnx/decoder_model_merged.onnx
+vision_session = onnxruntime.InferenceSession("vision_encoder.onnx")
+embed_session = onnxruntime.InferenceSession("embed_tokens.onnx")
+decoder_session = onnxruntime.InferenceSession("decoder_model_merged.onnx")
+## Set config values
+num_key_value_heads = config.text_config.num_key_value_heads
+head_dim = config.text_config.head_dim
+num_hidden_layers = config.text_config.num_hidden_layers
+eos_token_id = config.text_config.eos_token_id
+image_token_id = config.image_token_id
+# 2. Prepare inputs
+## Create input messages
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image"},
+            {"type": "text", "text": "Can you describe this image?"}
+        ]
+    },
+]
+## Load image and apply processor
+image = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
+prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
+inputs = processor(text=prompt, images=[image], return_tensors="np")
+## Prepare decoder inputs
+batch_size = inputs['input_ids'].shape[0]
+past_key_values = {
+    f'past_key_values.{layer}.{kv}': np.zeros([batch_size, num_key_value_heads, 0, head_dim], dtype=np.float32)
+    for layer in range(num_hidden_layers)
+    for kv in ('key', 'value')
+}
+image_features = None
+input_ids = inputs['input_ids']
+attention_mask = inputs['attention_mask']
+position_ids = np.cumsum(inputs['attention_mask'], axis=-1)
+# 3. Generation loop
+max_new_tokens = 1024
+generated_tokens = np.array([[]], dtype=np.int64)
+for i in range(max_new_tokens):
+  inputs_embeds = embed_session.run(None, {'input_ids': input_ids})[0]
+  if image_features is None:
+    ## Only compute vision features if not already computed
+    image_features = vision_session.run(
+        ['image_features'],  # List of output names or indices
+        {
+            'pixel_values': inputs['pixel_values'],
+            'pixel_attention_mask': inputs['pixel_attention_mask'].astype(np.bool_)
+        }
+    )[0]
+    ## Merge text and vision embeddings
+    inputs_embeds[inputs['input_ids'] == image_token_id] = image_features.reshape(-1, image_features.shape[-1])
+  logits, *present_key_values = decoder_session.run(None, dict(
+      inputs_embeds=inputs_embeds,
+      attention_mask=attention_mask,
+      position_ids=position_ids,
+      **past_key_values,
+  ))
+  ## Update values for next generation loop
+  input_ids = logits[:, -1].argmax(-1, keepdims=True)
+  attention_mask = np.ones_like(input_ids)
+  position_ids = position_ids[:, -1:] + 1
+  for j, key in enumerate(past_key_values):
+    past_key_values[key] = present_key_values[j]
+  generated_tokens = np.concatenate([generated_tokens, input_ids], axis=-1)
+  if (input_ids == eos_token_id).all():
+    break
+  ## (Optional) Streaming
+  print(processor.decode(input_ids[0]), end='')
+print()
+# 4. Output result
+print(processor.batch_decode(generated_tokens))
+```
+Example output:
+```
+ The image depicts a large, historic statue of Liberty situated on a small island in a body of water. The statue is a green, cylindrical structure with a human figure at the top, which is the actual statue of Liberty. The statue is mounted on a pedestal that is supported by a cylindrical tower. The pedestal is rectangular and appears to be made of stone or a similar material. The statue is surrounded by a large, flat, rectangular area that is likely a base for the statue.
+In the background, there is a cityscape with a variety of buildings, including skyscrapers and high-rise buildings. The sky is clear with a gradient of colors, transitioning from a pale blue at the top to a deeper blue at the bottom. The buildings are mostly modern, with a mix of glass and concrete. The buildings are densely packed, with many skyscrapers and high-rise buildings visible.
+There are trees and greenery visible on the left side of the image, indicating that the statue is located near a park or a park area. The water in the foreground is calm, with small ripples indicating that the statue is in the water.
+The overall scene suggests a peaceful and serene environment, likely a public park or a park area in a city. The statue is likely a representation of liberty, representing the city's commitment to freedom and democracy.
+### Analysis and Description:
+#### Statue of Liberty:
+- **Location**: The statue is located on a small island in a body of water.
+- **Statue**: The statue is a green cylindrical structure with a human figure at the top, which is the actual statue of Liberty.
+- **Pedestal**: The pedestal is rectangular and supports the statue.
+- **Pedestrian**: The pedestal is surrounded by a flat rectangular area.
+- **Water**: The water is calm, with small ripples indicating that the statue is in the water.
+#### Cityscape:
+- **Buildings**: The buildings are modern, with a mix of glass and concrete.
+- **Sky**: The sky is clear with a gradient of colors, transitioning from a pale blue at the top to a deeper blue at the bottom.
+- **Trees**: There are trees and greenery visible on the left side of the image, indicating that the statue is located near a park or a park area.
+#### Environment:
+- **Water**: The water is calm, with small ripples indicating that the statue is in the water.
+- **Sky**: The sky is clear with a gradient of colors, transitioning from a pale blue at the top to a deeper blue at the bottom.
+### Conclusion:
+The image depicts a peaceful and serene public park or park area in a city, with the statue of Liberty prominently featured. The cityscape in the background includes modern buildings and a clear sky, suggesting a well-maintained public space.<end_of_utterance>
+```
+</details>
 ### Model optimizations