--- datasets: - yali30/findingdory language: - en base_model: - Qwen/Qwen2.5-VL-3B-Instruct pipeline_tag: image-text-to-text library_name: transformers tags: - habitat - embodied-ai - memory ---
arXiv Website GitHub Code Huggingface

FindingDory: A Benchmark to Evaluate Memory in Embodied Agents

Karmesh Yadav*, Yusuf Ali*, Gunshi Gupta, Yarin Gal, Zsolt Kira
Current vision-language models (VLMs) struggle with long-term memory in embodied tasks. To address this, we introduce **FindingDory**, a benchmark in Habitat that evaluates memory-based reasoning across 60 long-horizon tasks. In this repo, we release a **Qwen2.5-VL-3B-Instruct** checkpoint trained on the training split of **FindingDory**. It takes in image frames from a video collected by the agent previously, subsampled to 96 frames. Its output is a **frame index** (or a bunch of indices) pointing to the image in the agent’s history that satisfies the task instruction (e.g. “navigate to the object you interacted with _immediately after_ the mug”). At deployment the image corresponding to the index is fed into a low-level navigation policy to complete the embodied task. 🏋️ Training details | Property | Value | | -------- | ----- | | Epochs | 5 (Total training steps 12840) | | Effective batch | 32 | | LR schedule | Cosine (LR=5e-6, Warmup ratio=0.1) | | Max Pixels. | 360 x 420 | | Compute | “8 × A40 48 GB for ~84 hours” | | Input frames | 96 Images (~10k tokens) | | Optimiser | AdamW(β₁ = 0.9, β₂ = 0.95) | | Best checkpoint | 8800 Steps | 📊 Evaluation We compare the performance of our finetuned `FindingDory-Qwen2.5-VL-3B-SFT` checkpoint against other models below: | Model | High-level Success Rate | Notes | | ----- | ----------------------- | ----- | | FindingDory-Qwen2.5-VL-3B-SFT | 52.4% | ours | | Base Qwen2.5-VL-7B-Instruct | 15.1% | zero-shot | | Gemma3-12B-it | 13.2% | zero-shot | | GPT-4o | 27.3% | zero-shot | | Gemini-2.0-Flash | 25.4% | zero-shot | Checkout Fig 2 in the paper for more details. 📄 Citation ``` @article{yadav2025findingdory, title = {FindingDory: A Benchmark to Evaluate Memory in Embodied Agents}, author = {Yadav, Karmesh and Ali, Yusuf and Gupta, Gunshi and Gal, Yarin and Kira, Zsolt}, journal = {arXiv preprint arXiv:2506.15635}, year = {2025} } ```