Instructions to use yali30/findingdory-qwen2.5-VL-3B-finetuned with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use yali30/findingdory-qwen2.5-VL-3B-finetuned with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="yali30/findingdory-qwen2.5-VL-3B-finetuned") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("yali30/findingdory-qwen2.5-VL-3B-finetuned") model = AutoModelForImageTextToText.from_pretrained("yali30/findingdory-qwen2.5-VL-3B-finetuned") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use yali30/findingdory-qwen2.5-VL-3B-finetuned with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "yali30/findingdory-qwen2.5-VL-3B-finetuned" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "yali30/findingdory-qwen2.5-VL-3B-finetuned", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/yali30/findingdory-qwen2.5-VL-3B-finetuned
- SGLang
How to use yali30/findingdory-qwen2.5-VL-3B-finetuned with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "yali30/findingdory-qwen2.5-VL-3B-finetuned" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "yali30/findingdory-qwen2.5-VL-3B-finetuned", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "yali30/findingdory-qwen2.5-VL-3B-finetuned" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "yali30/findingdory-qwen2.5-VL-3B-finetuned", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use yali30/findingdory-qwen2.5-VL-3B-finetuned with Docker Model Runner:
docker model run hf.co/yali30/findingdory-qwen2.5-VL-3B-finetuned
FindingDory: A Benchmark to Evaluate Memory in Embodied Agents
Karmesh Yadav*, Yusuf Ali*, Gunshi Gupta, Yarin Gal, Zsolt KiraCurrent vision-language models (VLMs) struggle with long-term memory in embodied tasks. To address this, we introduce FindingDory, a benchmark in Habitat that evaluates memory-based reasoning across 60 long-horizon tasks.
In this repo, we release a Qwen2.5-VL-3B-Instruct checkpoint trained on the training split of FindingDory. It takes in image frames from a video collected by the agent previously, subsampled to 96 frames. Its output is a frame index (or a bunch of indices) pointing to the image in the agent’s history that satisfies the task instruction (e.g. “navigate to the object you interacted with immediately after the mug”).
At deployment the image corresponding to the index is fed into a low-level navigation policy to complete the embodied task.
🏋️ Training details
| Property | Value |
|---|---|
| Epochs | 5 (Total training steps 12840) |
| Effective batch | 32 |
| LR schedule | Cosine (LR=5e-6, Warmup ratio=0.1) |
| Max Pixels. | 360 x 420 |
| Compute | “8 × A40 48 GB for ~84 hours” |
| Input frames | 96 Images (~10k tokens) |
| Optimiser | AdamW(β₁ = 0.9, β₂ = 0.95) |
| Best checkpoint | 8800 Steps |
📊 Evaluation
We compare the performance of our finetuned FindingDory-Qwen2.5-VL-3B-SFT checkpoint against other models below:
| Model | High-level Success Rate | Notes |
|---|---|---|
| FindingDory-Qwen2.5-VL-3B-SFT | 52.4% | ours |
| Base Qwen2.5-VL-7B-Instruct | 15.1% | zero-shot |
| Gemma3-12B-it | 13.2% | zero-shot |
| GPT-4o | 27.3% | zero-shot |
| Gemini-2.0-Flash | 25.4% | zero-shot |
Checkout Fig 2 in the paper for more details.
📄 Citation
@article{yadav2025findingdory,
title = {FindingDory: A Benchmark to Evaluate Memory in Embodied Agents},
author = {Yadav, Karmesh and Ali, Yusuf and Gupta, Gunshi and Gal, Yarin and Kira, Zsolt},
journal = {arXiv preprint arXiv:2506.15635},
year = {2025}
}
- Downloads last month
- 105
Model tree for yali30/findingdory-qwen2.5-VL-3B-finetuned
Base model
Qwen/Qwen2.5-VL-3B-Instruct