MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining Paper • 2505.07608 • Published May 12 • 81
Next Block Prediction: Video Generation via Semi-Autoregressive Modeling Paper • 2502.07737 • Published Feb 11 • 9
Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation Paper • 2503.16430 • Published Mar 20 • 34
Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey Paper • 2412.18619 • Published Dec 16, 2024 • 59
VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models Paper • 2311.17404 • Published Nov 29, 2023 • 1
Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition Paper • 2304.04704 • Published Apr 10, 2023
Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond Paper • 2310.02071 • Published Oct 3, 2023 • 4
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding Paper • 2310.19060 • Published Oct 29, 2023
DCA: Diversified Co-Attention towards Informative Live Video Commenting Paper • 1911.02739 • Published Nov 7, 2019
PCA-Bench: Evaluating Multimodal Large Language Models in Perception-Cognition-Action Chain Paper • 2402.15527 • Published Feb 21, 2024
LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation? Paper • 2404.10763 • Published Apr 16, 2024
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis Paper • 2405.21075 • Published May 31, 2024 • 25
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding Paper • 2312.02051 • Published Dec 4, 2023 • 1
M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning Paper • 2306.04387 • Published Jun 7, 2023 • 8