Multi-image?

from transformers import (
    PaliGemmaProcessor,
    PaliGemmaForConditionalGeneration,
)
from transformers.image_utils import load_image
import torch

model_id = "google/paligemma2-10b-pt-448"

url1 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"
image1 = load_image(url1)

url2 = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Motorboat_at_Kankaria_lake.JPG/1280px-Motorboat_at_Kankaria_lake.JPG"
image2 = load_image(url2)
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto").eval()
processor = PaliGemmaProcessor.from_pretrained(model_id)

# Leaving the prompt blank for pre-trained models
prompt = "Describe these images in detail"
model_inputs = processor(text=prompt, images=[[image1, image2]], return_tensors="pt").to(torch.bfloat16).to(model.device)
input_len = model_inputs["input_ids"].shape[-1]

# print("model_inputs: ", model_inputs)

with torch.inference_mode():
    generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]
    decoded = processor.decode(generation, skip_special_tokens=True)
    print("result: ", decoded)

This only outputs:

result:  Image: A boat in the water

Are there any other examples of multi-image? Maybe we are missing something?

pbarker changed discussion status to open Dec 9, 2024

lkv

Google org Sep 18

Hi ,

Apologies for the delay, The model must be fine- tuned on a dataset specifically designed for multi-image reasoning, like the Natural Language for Visual reasoning dataset. These fine- tuned checkpoints teach the model to compare and contrast images, understand relationships between them, or answer questions that requires context from more than one image.
You can take the base PaliGemma model and fine-tune it on a multi - image dataset tailored to your specific use case. This gives you full control over the model's final capabilities.

Kindly follow this documentation clarifies that fine-tuning is required for specific tasks, including those that involve multiple images. It provides details on the model's architecture and intended use cases.

Thank you.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment