Instructions to use ibm-granite/granite-vision-3.1-2b-preview with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ibm-granite/granite-vision-3.1-2b-preview with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="ibm-granite/granite-vision-3.1-2b-preview") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("ibm-granite/granite-vision-3.1-2b-preview") model = AutoModelForImageTextToText.from_pretrained("ibm-granite/granite-vision-3.1-2b-preview") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use ibm-granite/granite-vision-3.1-2b-preview with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ibm-granite/granite-vision-3.1-2b-preview" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ibm-granite/granite-vision-3.1-2b-preview", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/ibm-granite/granite-vision-3.1-2b-preview
- SGLang
How to use ibm-granite/granite-vision-3.1-2b-preview with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ibm-granite/granite-vision-3.1-2b-preview" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ibm-granite/granite-vision-3.1-2b-preview", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ibm-granite/granite-vision-3.1-2b-preview" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ibm-granite/granite-vision-3.1-2b-preview", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use ibm-granite/granite-vision-3.1-2b-preview with Docker Model Runner:
docker model run hf.co/ibm-granite/granite-vision-3.1-2b-preview
Errors with quantized model
I am using the following quantization method:
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True
)
model = AutoModelForVision2Seq.from_pretrained("ibm-granite/granite-vision-3.1-2b-preview", quantization_config=bnb_config)
During generation, I get an error:
/usr/local/lib/python3.11/dist-packages/torch/nn/functional.py in multi_head_attention_forward(query, key, value, embed_dim_to_check, num_heads, in_proj_weight, in_proj_bias, bias_k, bias_v, add_zero_attn, dropout_p, out_proj_weight, out_proj_bias, training, key_padding_mask, need_weights, attn_mask, use_separate_proj_weight, q_proj_weight, k_proj_weight, v_proj_weight, static_k, static_v, average_attn_weights, is_causal)
6249 attn_output.transpose(0, 1).contiguous().view(tgt_len * bsz, embed_dim)
6250 )
-> 6251 attn_output = linear(attn_output, out_proj_weight, out_proj_bias)
6252 attn_output = attn_output.view(tgt_len, bsz, attn_output.size(1))
6253
RuntimeError: self and mat2 must have the same dtype, but got Half and Byte
It works fine w/o quantization, however quantization is useful during fine-tuning, could you please suggest how to make it work?
Thank you
Thank you for raising this issue,
We managed to reproduce the error and are currently investigating.
There's an issue with the quantization of the vision encoder.
Quantizing with the following config should work:
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
llm_int8_skip_modules=["vision_tower", "lm_head"], # Skip problematic modules
llm_int8_enable_fp32_cpu_offload=True
)
Thank you, I will give it a try. I am basically trying to reduce the model size so that I can fine-tune it on A100 GPU
Check out the example here:
https://huggingface.co/learn/cookbook/en/fine_tuning_granite_vision_sft_trl
I still need to push the quantization fix there, but the full fine tuning works on A100.
It works, thank you! This is very helpful