3DGraphLLM

3DGraphLLM is a model that combines semantic graphs and large language models for 3D scene understanding. It aims to improve 3D vision-language tasks by explicitly incorporating semantic relationships into a learnable representation of a 3D scene graph, which is then used as input to LLMs.

This model was presented in the paper: 3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding

The official code is publicly available at: https://github.com/CognitiveAISystems/3DGraphLLM

Abstract

A 3D scene graph represents a compact scene model by capturing both the objects present and the semantic relationships between them, making it a promising structure for robotic applications. To effectively interact with users, an embodied intelligent agent should be able to answer a wide range of natural language queries about the surrounding 3D environment. Large Language Models (LLMs) are beneficial solutions for user-robot interaction due to their natural language understanding and reasoning abilities. Recent methods for learning scene representations have shown that adapting these representations to the 3D world can significantly improve the quality of LLM responses. However, existing methods typically rely only on geometric information, such as object coordinates, and overlook the rich semantic relationships between objects. In this work, we propose 3DGraphLLM, a method for constructing a learnable representation of a 3D scene graph that explicitly incorporates semantic relationships. This representation is used as input to LLMs for performing 3D vision-language tasks. In our experiments on popular ScanRefer, Multi3DRefer, ScanQA, Sqa3D, and Scan2cap datasets, we demonstrate that our approach outperforms baselines that do not leverage semantic relationships between objects.

Model Details

We provide our best checkpoint that uses Llama-3-8B-Instruct as an LLM, Mask3D 3D instance segmentation to get scene graph nodes, VL-SAT to encode semantic relations Uni3D as 3D object encoder, and DINOv2 as 2D object encoder.

Performance

Semantic relations boost LLM performance on 3D Referred Object Grounding and Dense Scene Captioning tasks.

| | ScanRefer | | Multi3dRefer | | Scan2Cap | | ScanQA | | SQA3D | |:----: |:---------: |:-------: |:------: |:------: |:---------: |:----------: |:------------: |:------: |:-----: | | | Acc@0.25 | Acc@0.5 | F1@0.25 | F1@0.5 | CIDEr@0.5 | B-4@0.5 | CIDEr | B-4 | EM | | Chat-Scene | 55.5 | 50.2 | 57.1 | 52.3 | 77.1 | 36.3 | 87.7 | 14.3 | 54.6 | | 3DGraphLLM Vicuna-1.5 | 58.6 | 53.0 | 61.9 | 57.3 | 79.2 | 34.7 | 91.2 | 13.7 | 55.1 | | 3DGraphLLM LLAMA3-8B | 62.4 | 56.6 | 64.7 | 59.9 | 81.0 | 36.5 | 88.8 | 15.9 | 55.9 |

Usage

For detailed instructions on environment preparation, downloading LLM backbones, data preprocessing, training, and inference, please refer to the official GitHub repository.

You can run their interactive demo by following the instructions on their GitHub, or try the simplified command below:

bash demo/run_demo.sh

This will prompt you to ask different queries about Scene 435 of ScanNet.

Citation

If you find 3DGraphLLM helpful, please consider citing our work as:

@misc{zemskova20243dgraphllm,
      title={3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding}, 
      author={Tatiana Zemskova and Dmitry Yudin},
      year={2024},
      eprint={2412.18450},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.18450}, 
}

Acknowledgement

Thanks to the open source of the following projects:

Chat-Scene

Contact

If you have any questions about the project, please open an issue in the GitHub repository or send an email to Tatiana Zemskova.