FlashVGGT: Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention
Abstract
FlashVGGT uses descriptor-based attention to efficiently perform 3D reconstruction from multi-view images, significantly reducing inference time and improving scalability compared to VGGT.
3D reconstruction from multi-view images is a core challenge in computer vision. Recently, feed-forward methods have emerged as efficient and robust alternatives to traditional per-scene optimization techniques. Among them, state-of-the-art models like the Visual Geometry Grounding Transformer (VGGT) leverage full self-attention over all image tokens to capture global relationships. However, this approach suffers from poor scalability due to the quadratic complexity of self-attention and the large number of tokens generated in long image sequences. In this work, we introduce FlashVGGT, an efficient alternative that addresses this bottleneck through a descriptor-based attention mechanism. Instead of applying dense global attention across all tokens, FlashVGGT compresses spatial information from each frame into a compact set of descriptor tokens. Global attention is then computed as cross-attention between the full set of image tokens and this smaller descriptor set, significantly reducing computational overhead. Moreover, the compactness of the descriptors enables online inference over long sequences via a chunk-recursive mechanism that reuses cached descriptors from previous chunks. Experimental results show that FlashVGGT achieves reconstruction accuracy competitive with VGGT while reducing inference time to just 9.3% of VGGT for 1,000 images, and scaling efficiently to sequences exceeding 3,000 images. Our project page is available at https://wzpscott.github.io/flashvggt_page/.
Community
TLDR: Accelerate VGGT with more efficient global attention for ~10x faster inference on 1K images and scaling to 3K+ images.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SwiftVGGT: A Scalable Visual Geometry Grounded Transformer for Large-Scale Scenes (2025)
- MoRE: 3D Visual Geometry Reconstruction Meets Mixture-of-Experts (2025)
- How Many Tokens Do 3D Point Cloud Transformer Architectures Really Need? (2025)
- HTTM: Head-wise Temporal Token Merging for Faster VGGT (2025)
- Human3R: Everyone Everywhere All at Once (2025)
- Fin3R: Fine-tuning Feed-forward 3D Reconstruction Models via Monocular Knowledge Distillation (2025)
- VGGT4D: Mining Motion Cues in Visual Geometry Transformers for 4D Scene Reconstruction (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper