LongVLM: Efficient Long Video Understanding via Large Language Models Paper • 2404.03384 • Published Apr 4, 2024
MALMM: Multi-Agent Large Language Models for Zero-Shot Robotics Manipulation Paper • 2411.17636 • Published Nov 26, 2024 • 2
RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation Paper • 2412.08591 • Published Dec 11, 2024
WorldWeaver: Generating Long-Horizon Video Worlds via Rich Perception Paper • 2508.15720 • Published Aug 21
Self-Consistency as a Free Lunch: Reducing Hallucinations in Vision-Language Models via Self-Reflection Paper • 2509.23236 • Published Sep 27 • 1
Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos Paper • 2312.10300 • Published Dec 16, 2023 • 1