MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation
Abstract
Mixture-of-Groups Attention (MoGA) enables efficient long video generation by addressing the quadratic scaling issue of full attention in Diffusion Transformers.
Long video generation with Diffusion Transformers (DiTs) is bottlenecked by the quadratic scaling of full attention with sequence length. Since attention is highly redundant, outputs are dominated by a small subset of query-key pairs. Existing sparse methods rely on blockwise coarse estimation, whose accuracy-efficiency trade-offs are constrained by block size. This paper introduces Mixture-of-Groups Attention (MoGA), an efficient sparse attention that uses a lightweight, learnable token router to precisely match tokens without blockwise estimation. Through semantic-aware routing, MoGA enables effective long-range interactions. As a kernel-free method, MoGA integrates seamlessly with modern attention stacks, including FlashAttention and sequence parallelism. Building on MoGA, we develop an efficient long video generation model that end-to-end produces minute-level, multi-shot, 480p videos at 24 fps, with a context length of approximately 580k. Comprehensive experiments on various video generation tasks validate the effectiveness of our approach.
Community
Long video generation with Diffusion Transformers (DiTs) is bottlenecked by the quadratic scaling of full attention with sequence length. Since attention is highly redundant, outputs are dominated by a small subset of query-key pairs. Existing sparse methods rely on blockwise coarse estimation, whose accuracy-efficiency trade-offs are constrained by block size. This paper introduces Mixture-of-Groups Attention (MoGA), an efficient sparse attention that uses a lightweight, learnable token router to precisely match tokens without blockwise estimation. Through semantic-aware routing, MoGA enables effective long-range interactions. As a kernel-free method, MoGA integrates seamlessly with modern attention stacks, including FlashAttention and sequence parallelism. Building on MoGA, we develop an efficient long video generation model that end-to-end produces minute-level, multi-shot, 480p videos at 24 fps, with a context length of approximately 580k. Comprehensive experiments on various video generation tasks validate the effectiveness of our approach.
Thank you for the report. The official git repository of MoGA is available here: https://github.com/bytedance-fanqie-ai/MoGA
It is still not available the github repo? Just the README file and 3 ong files. Any plans to do update it today? Appreciate.
It is still not available the github repo? Just the README file and 3 png files. Any plans to do update it today? Appreciate.
Thanks for your attention. The code is still undergoing the company's review process and may take some time before being updated to the repository.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Mixture of Contexts for Long Video Generation (2025)
- Bidirectional Sparse Attention for Faster Video Diffusion Training (2025)
- Pack and Force Your Memory: Long-form and Consistent Video Generation (2025)
- VideoNSA: Native Sparse Attention Scales Video Understanding (2025)
- MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models (2025)
- LongLive: Real-time Interactive Long Video Generation (2025)
- Dense Video Understanding with Gated Residual Tokenization (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper