SpatialLM: Training Large Language Models for Structured Indoor Modeling
Abstract
SpatialLM, a multimodal large language model, processes 3D point cloud data to generate structured scene understanding outputs, achieving state-of-the-art performance in layout estimation and competitive results in 3D object detection.
SpatialLM is a large language model designed to process 3D point cloud data and generate structured 3D scene understanding outputs. These outputs include architectural elements like walls, doors, windows, and oriented object boxes with their semantic categories. Unlike previous methods which exploit task-specific network designs, our model adheres to the standard multimodal LLM architecture and is fine-tuned directly from open-source LLMs. To train SpatialLM, we collect a large-scale, high-quality synthetic dataset consisting of the point clouds of 12,328 indoor scenes (54,778 rooms) with ground-truth 3D annotations, and conduct a careful study on various modeling and training decisions. On public benchmarks, our model gives state-of-the-art performance in layout estimation and competitive results in 3D object detection. With that, we show a feasible path for enhancing the spatial understanding capabilities of modern LLMs for applications in augmented reality, embodied robotics, and more.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors (2025)
- Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence (2025)
- Masked Point-Entity Contrast for Open-Vocabulary 3D Scene Understanding (2025)
- Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving (2025)
- Struct2D: A Perception-Guided Framework for Spatial Reasoning in Large Multimodal Models (2025)
- Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D (2025)
- SAB3R: Semantic-Augmented Backbone in 3D Reconstruction (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 2
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper