3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding
Abstract
3D-R1 enhances 3D scene understanding through a high-quality synthetic dataset, reinforcement learning with GRPO, and dynamic view selection, achieving significant improvements in reasoning and generalization.
Large vision-language models (VLMs) have made significant strides in 2D visual understanding tasks, sparking interest in extending these capabilities to 3D scene understanding. However, current 3D VLMs often struggle with robust reasoning and generalization due to limitations in high-quality spatial data and the static nature of viewpoint assumptions. To address these challenges, we propose 3D-R1, a foundation model that enhances the reasoning capabilities of 3D VLMs. Specifically, we first construct a high-quality synthetic dataset with CoT, named Scene-30K, leveraging existing 3D-VL datasets and a data engine based on Gemini 2.5 Pro. It serves as cold-start initialization data for 3D-R1. Moreover, we leverage RLHF policy such as GRPO in the reinforcement learning training process to enhance reasoning capabilities and introduce three reward functions: a perception reward, a semantic similarity reward and a format reward to maintain detection accuracy and answer semantic precision. Furthermore, we introduce a dynamic view selection strategy that adaptively chooses the most informative perspectives for 3D scene understanding. Extensive experiments demonstrate that 3D-R1 delivers an average improvement of 10% across various 3D scene benchmarks, highlighting its effectiveness in enhancing reasoning and generalization in 3D scene understanding. Code: https://github.com/AIGeeksGroup/3D-R1. Website: https://aigeeksgroup.github.io/3D-R1.
Community
๐ Just released our brand-new open-source 3D foundation model 3D-R1!
3D-R1 is designed to comprehensively enhance scene understanding through reinforcement learning. The entire project is now fully open-sourced.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations (2025)
- Descrip3D: Enhancing Large Language Model-based 3D Scene Understanding with Object-Level Text Descriptions (2025)
- GaussianVLM: Scene-centric 3D Vision-Language Models using Language-aligned Gaussian Splats for Embodied Reasoning and Beyond (2025)
- Argus: Leveraging Multiview Images for Improved 3-D Scene Understanding With Large Language Models. (2025)
- Spatial 3D-LLM: Exploring Spatial Awareness in 3D Vision-Language Models (2025)
- Enhancing Spatial Reasoning in Vision-Language Models via Chain-of-Thought Prompting and Reinforcement Learning (2025)
- 3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper