---
datasets:
- yali30/findingdory
language:
- en
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- habitat
- embodied-ai
- memory
---
FindingDory: A Benchmark to Evaluate Memory in Embodied Agents
Karmesh Yadav*,
Yusuf Ali*,
Gunshi Gupta,
Yarin Gal,
Zsolt Kira
Current vision-language models (VLMs) struggle with long-term memory in embodied tasks. To address this, we introduce **FindingDory**, a benchmark in Habitat that evaluates memory-based reasoning across 60 long-horizon tasks.
In this repo, we release a **Qwen2.5-VL-3B-Instruct** checkpoint trained on the training split of **FindingDory**. It takes in image frames from a video collected by the agent previously, subsampled to 96 frames. Its output is a **frame index** (or a bunch of indices) pointing to the image in the agent’s history that satisfies the task instruction (e.g. “navigate to the object you interacted with _immediately after_ the mug”).
At deployment the image corresponding to the index is fed into a low-level navigation policy to complete the embodied task.
🏋️ Training details
| Property | Value |
| -------- | ----- |
| Epochs | 5 (Total training steps 12840) |
| Effective batch | 32 |
| LR schedule | Cosine (LR=5e-6, Warmup ratio=0.1) |
| Max Pixels. | 360 x 420 |
| Compute | “8 × A40 48 GB for ~84 hours” |
| Input frames | 96 Images (~10k tokens) |
| Optimiser | AdamW(β₁ = 0.9, β₂ = 0.95) |
| Best checkpoint | 8800 Steps |
📊 Evaluation
We compare the performance of our finetuned `FindingDory-Qwen2.5-VL-3B-SFT` checkpoint against other models below:
| Model | High-level Success Rate | Notes |
| ----- | ----------------------- | ----- |
| FindingDory-Qwen2.5-VL-3B-SFT | 52.4% | ours |
| Base Qwen2.5-VL-7B-Instruct | 15.1% | zero-shot |
| Gemma3-12B-it | 13.2% | zero-shot |
| GPT-4o | 27.3% | zero-shot |
| Gemini-2.0-Flash | 25.4% | zero-shot |
Checkout Fig 2 in the paper for more details.
📄 Citation
```
@article{yadav2025findingdory,
title = {FindingDory: A Benchmark to Evaluate Memory in Embodied Agents},
author = {Yadav, Karmesh and Ali, Yusuf and Gupta, Gunshi and Gal, Yarin and Kira, Zsolt},
journal = {arXiv preprint arXiv:2506.15635},
year = {2025}
}
```