Multi-image?

#2
by pbarker - opened

Does this support multi-image inputs?

Google org

Hi @pbarker ,

Yes, it can support the multi-image inputs. For more reference, could you please refer to this documentation.

Thank you.

pbarker changed discussion status to closed

Actually sorry, this doesn't seem to work:

from transformers import (
    PaliGemmaProcessor,
    PaliGemmaForConditionalGeneration,
)
from transformers.image_utils import load_image
import torch

model_id = "google/paligemma2-10b-pt-448"

url1 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"
image1 = load_image(url1)

url2 = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Motorboat_at_Kankaria_lake.JPG/1280px-Motorboat_at_Kankaria_lake.JPG"
image2 = load_image(url2)
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto").eval()
processor = PaliGemmaProcessor.from_pretrained(model_id)

# Leaving the prompt blank for pre-trained models
prompt = "Describe these images in detail"
model_inputs = processor(text=prompt, images=[[image1, image2]], return_tensors="pt").to(torch.bfloat16).to(model.device)
input_len = model_inputs["input_ids"].shape[-1]

# print("model_inputs: ", model_inputs)

with torch.inference_mode():
    generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]
    decoded = processor.decode(generation, skip_special_tokens=True)
    print("result: ", decoded)

This only outputs:

result:  Image: A boat in the water

Are there any other examples of multi-image? Maybe we are missing something?

pbarker changed discussion status to open
Google org

Hi ,

Apologies for the delay, The model must be fine- tuned on a dataset specifically designed for multi-image reasoning, like the Natural Language for Visual reasoning dataset. These fine- tuned checkpoints teach the model to compare and contrast images, understand relationships between them, or answer questions that requires context from more than one image.
You can take the base PaliGemma model and fine-tune it on a multi - image dataset tailored to your specific use case. This gives you full control over the model's final capabilities.

Kindly follow this documentation clarifies that fine-tuning is required for specific tasks, including those that involve multiple images. It provides details on the model's architecture and intended use cases.

Thank you.

Sign up or log in to comment