Instructions to use yali30/findingdory-qwen2.5-VL-3B-finetuned with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use yali30/findingdory-qwen2.5-VL-3B-finetuned with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="yali30/findingdory-qwen2.5-VL-3B-finetuned")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("yali30/findingdory-qwen2.5-VL-3B-finetuned")
model = AutoModelForImageTextToText.from_pretrained("yali30/findingdory-qwen2.5-VL-3B-finetuned")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use yali30/findingdory-qwen2.5-VL-3B-finetuned with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "yali30/findingdory-qwen2.5-VL-3B-finetuned"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "yali30/findingdory-qwen2.5-VL-3B-finetuned",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/yali30/findingdory-qwen2.5-VL-3B-finetuned

SGLang

How to use yali30/findingdory-qwen2.5-VL-3B-finetuned with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "yali30/findingdory-qwen2.5-VL-3B-finetuned" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "yali30/findingdory-qwen2.5-VL-3B-finetuned",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "yali30/findingdory-qwen2.5-VL-3B-finetuned" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "yali30/findingdory-qwen2.5-VL-3B-finetuned",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use yali30/findingdory-qwen2.5-VL-3B-finetuned with Docker Model Runner:
```
docker model run hf.co/yali30/findingdory-qwen2.5-VL-3B-finetuned
```

FindingDory: A Benchmark to Evaluate Memory in Embodied Agents

Karmesh Yadav*, Yusuf Ali*, Gunshi Gupta, Yarin Gal, Zsolt Kira

Current vision-language models (VLMs) struggle with long-term memory in embodied tasks. To address this, we introduce FindingDory, a benchmark in Habitat that evaluates memory-based reasoning across 60 long-horizon tasks.

In this repo, we release a Qwen2.5-VL-3B-Instruct checkpoint trained on the training split of FindingDory. It takes in image frames from a video collected by the agent previously, subsampled to 96 frames. Its output is a frame index (or a bunch of indices) pointing to the image in the agent’s history that satisfies the task instruction (e.g. “navigate to the object you interacted with immediately after the mug”).
At deployment the image corresponding to the index is fed into a low-level navigation policy to complete the embodied task.

🏋️ Training details

Property	Value
Epochs	5 (Total training steps 12840)
Effective batch	32
LR schedule	Cosine (LR=5e-6, Warmup ratio=0.1)
Max Pixels.	360 x 420
Compute	“8 × A40 48 GB for ~84 hours”
Input frames	96 Images (~10k tokens)
Optimiser	AdamW(β₁ = 0.9, β₂ = 0.95)
Best checkpoint	8800 Steps

📊 Evaluation We compare the performance of our finetuned FindingDory-Qwen2.5-VL-3B-SFT checkpoint against other models below:

Model	High-level Success Rate	Notes
FindingDory-Qwen2.5-VL-3B-SFT	52.4%	ours
Base Qwen2.5-VL-7B-Instruct	15.1%	zero-shot
Gemma3-12B-it	13.2%	zero-shot
GPT-4o	27.3%	zero-shot
Gemini-2.0-Flash	25.4%	zero-shot

Checkout Fig 2 in the paper for more details.

📄 Citation

@article{yadav2025findingdory,
  title     = {FindingDory: A Benchmark to Evaluate Memory in Embodied Agents},
  author    = {Yadav, Karmesh and Ali, Yusuf and Gupta, Gunshi and Gal, Yarin and Kira, Zsolt},
  journal   = {arXiv preprint arXiv:2506.15635},
  year      = {2025}
}

Downloads last month: 105

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for yali30/findingdory-qwen2.5-VL-3B-finetuned

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Finetuned

(790)

this model

Dataset used to train yali30/findingdory-qwen2.5-VL-3B-finetuned

Paper for yali30/findingdory-qwen2.5-VL-3B-finetuned

FindingDory: A Benchmark to Evaluate Memory in Embodied Agents

Paper • 2506.15635 • Published Jun 18, 2025