|
--- |
|
language: en |
|
license: apache-2.0 |
|
tags: |
|
- vision |
|
- vqa |
|
- 16bit |
|
- quantized |
|
--- |
|
|
|
# Paligemma-3b-ft-vizwizvqa-224 (16-bit Quantized) |
|
|
|
This is a 16-bit quantized version of google/paligemma-3b-ft-vizwizvqa-224, fine-tuned for visual question answering on the VizWiz dataset. |
|
|
|
## Usage |
|
|
|
```python |
|
from transformers import AutoProcessor, AutoModelForImageTextToText |
|
from PIL import Image |
|
|
|
processor = AutoProcessor.from_pretrained("akazen/paligemma-3b-ft-vizwizvqa-16bit") |
|
model = AutoModelForImageTextToText.from_pretrained( |
|
"akazen/paligemma-3b-ft-vizwizvqa-16bit", |
|
device_map="auto" |
|
) |
|
|
|
# Process an image |
|
image = Image.open("your_image.jpg").convert("RGB") |
|
question = "What's in this image?" |
|
prompt = f"<image>\nQuestion: {question}\nAnswer:" |
|
|
|
inputs = processor(images=image, text=prompt, return_tensors="pt").to(model.device) |
|
outputs = model.generate(**inputs, max_new_tokens=20) |
|
answer = processor.decode(outputs[0], skip_special_tokens=True) |
|
``` |
|
|