Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities Paper • 2503.03983 • Published Mar 6 • 25
Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control Paper • 2503.14492 • Published Mar 18 • 20
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion Paper • 2506.08009 • Published Jun 9 • 26
Intuitive physics understanding emerges from self-supervised pretraining on natural videos Paper • 2502.11831 • Published Feb 17 • 20
Perception Encoder: The best visual embeddings are not at the output of the network Paper • 2504.13181 • Published Apr 17 • 35
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Paper • 2502.14786 • Published Feb 20 • 146
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model Paper • 2502.10248 • Published Feb 14 • 56