Pixels, Patterns, but No Poetry: To See The World like Humans Paper • 2507.16863 • Published 9 days ago • 62
VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning? Paper • 2505.23359 • Published May 29 • 40
VideoMamba: State Space Model for Efficient Video Understanding Paper • 2403.06977 • Published Mar 11, 2024 • 31
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning Paper • 2410.19702 • Published Oct 25, 2024
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling Paper • 2501.00574 • Published Dec 31, 2024 • 6
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling Paper • 2501.12386 • Published Jan 21 • 1
Online Video Understanding: A Comprehensive Benchmark and Memory-Augmented Method Paper • 2501.00584 • Published Dec 31, 2024
Fine-grained Video-Text Retrieval: A New Benchmark and Method Paper • 2501.00513 • Published Dec 31, 2024
VideoEval: Comprehensive Benchmark Suite for Low-Cost Evaluation of Video Foundation Model Paper • 2407.06491 • Published Jul 9, 2024
OpenGVLab/VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B Video-Text-to-Text • 9B • Updated May 16 • 81 • 5