SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding
Abstract
A data-generation framework using 3D simulators improves spatial reasoning in multimodal language models with efficient training on simulated data.
Despite impressive high-level video comprehension, multimodal language models struggle with spatial reasoning across time and space. While current spatial training approaches rely on real-world video data, obtaining diverse footage with precise spatial annotations remains a bottleneck. To alleviate this bottleneck, we present SIMS-V -- a systematic data-generation framework that leverages the privileged information of 3D simulators to create spatially-rich video training data for multimodal language models. Using this framework, we investigate which properties of simulated data drive effective real-world transfer through systematic ablations of question types, mixes, and scales. We identify a minimal set of three question categories (metric measurement, perspective-dependent reasoning, and temporal tracking) that prove most effective for developing transferable spatial intelligence, outperforming comprehensive coverage despite using fewer question types. These insights enable highly efficient training: our 7B-parameter video LLM fine-tuned on just 25K simulated examples outperforms the larger 72B baseline and achieves competitive performance with proprietary models on rigorous real-world spatial reasoning benchmarks. Our approach demonstrates robust generalization, maintaining performance on general video understanding while showing substantial improvements on embodied and real-world spatial tasks.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models (2025)
- TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility (2025)
- Fleming-VL: Towards Universal Medical Visual Reasoning with Multimodal LLMs (2025)
- See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model (2025)
- VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception (2025)
- Video-STR: Reinforcing MLLMs in Video Spatio-Temporal Reasoning with Relation Graph (2025)
- SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper