Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations
Abstract
Concerto, a minimalist model combining 3D self-distillation and 2D-3D joint embedding, achieves superior spatial feature learning and outperforms existing models in scene understanding and open-world perception.
Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP's language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.
Community
TL;DR: Concerto provides joint 2D-3D self-supervised pre-trained Point Transformer V3 for 3D point cloud downstream tasks, modified from Sonata.
Homepage: https://pointcept.github.io/Concerto/
Gradio Demo: https://huggingface.co/spaces/Pointcept/Concerto
Inference Code: https://github.com/Pointcept/Concerto
Training Code: https://github.com/Pointcept/Pointcept
Thanks for the great work! ๐ ๐ ๐
Very cool paper. With world models / splatting and really anything that has to operate in 3d space I feel like there is likely a lot of information we can distill from 2d models that learn some sort of depth information as evidenced by Dino and other SSL models are decent depth detectors as is.
Here is a bite sized podcast about the work in case anyone wants to listen to an AI overview: https://spotifycreators-web.app.link/e/qfVxqCBJRXb
Thank you
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
 Xiaoyang Wu
							Xiaoyang Wu 
	 
					 
					 
					 
						

 
						