Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Abstract
Self Forcing, a novel training method for autoregressive video diffusion models, reduces exposure bias and improves generation quality through holistic video-level supervision and efficient caching mechanisms.
We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models. It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs during inference. Unlike prior methods that denoise future frames based on ground-truth context frames, Self Forcing conditions each frame's generation on previously self-generated outputs by performing autoregressive rollout with key-value (KV) caching during training. This strategy enables supervision through a holistic loss at the video level that directly evaluates the quality of the entire generated sequence, rather than relying solely on traditional frame-wise objectives. To ensure training efficiency, we employ a few-step diffusion model along with a stochastic gradient truncation strategy, effectively balancing computational cost and performance. We further introduce a rolling KV cache mechanism that enables efficient autoregressive video extrapolation. Extensive experiments demonstrate that our approach achieves real-time streaming video generation with sub-second latency on a single GPU, while matching or even surpassing the generation quality of significantly slower and non-causal diffusion models. Project website: http://self-forcing.github.io/
Community
Introducing Self-Forcing, a new paradigm for training autoregressive diffusion models that allows high quality, real-time video generation!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Playing with Transformer at 30+ FPS via Next-Frame Diffusion (2025)
- Long-Context State-Space Video World Models (2025)
- TalkingMachines: Real-Time Audio-Driven FaceTime-Style Video via Autoregressive Diffusion Models (2025)
- Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion (2025)
- Generative Pre-trained Autoregressive Diffusion Transformer (2025)
- MAGI-1: Autoregressive Video Generation at Scale (2025)
- Minute-Long Videos with Dual Parallelisms (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 2
Datasets citing this paper 0
No dataset linking this paper