Submitted by Hennara 83 Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR · 7 authors 1
Submitted by taesiri 32 MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe · 34 authors 22k 1
Submitted by taesiri 17 Hyper-Bagel: A Unified Acceleration Framework for Multimodal Understanding and Generation · 7 authors 1
Submitted by lhmd 14 VolSplat: Rethinking Feed-Forward 3D Gaussian Splatting with Voxel-Aligned Prediction · 10 authors 40 3
Submitted by taesiri 11 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation · 13 authors 135 3
Submitted by Yunzhen 11 What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT · 5 authors 1
Submitted by MinhDucBui 5 Large Language Models Discriminate Against Speakers of German Dialects · 5 authors 1
Submitted by ultra7chen 3 CAR-Flow: Condition-Aware Reparameterization Aligns Source and Target for Better Flow Matching · 10 authors 1
Submitted by emilia-wisnios 3 OpenGVL - Benchmarking Visual Temporal Progress for Data Curation · 6 authors 1
Submitted by ZipW 3 HyRF: Hybrid Radiance Fields for Memory-efficient and High-quality Novel View Synthesis · 2 authors 26 1
Submitted by Silin-Chen 3 SWE-QA: Can Language Models Answer Repository-level Code Questions? · 6 authors 16 1
Submitted by Two-hot 2 How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective · 18 authors 1
Submitted by spapi 2 Better Late Than Never: Evaluation of Latency Metrics for Simultaneous Speech-to-Text Translation · 4 authors 1
Submitted by taesiri 1 Zero-Shot Multi-Spectral Learning: Reimagining a Generalist Multimodal Gemini 2.5 Model for Remote Sensing Applications · 7 authors 1
Submitted by conan1024hao 1 VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction · 14 authors 1 1
Submitted by Fictionary 1 GeoSVR: Taming Sparse Voxels for Geometrically Accurate Surface Reconstruction · 7 authors 29 1
Submitted by abhilekhborah - DRISHTIKON: A Multimodal Multilingual Benchmark for Testing Language Models' Understanding on Indian Culture · 9 authors 1
Submitted by jesbu1 - PEEK: Guiding and Minimal Image Representations for Zero-Shot Generalization of Robot Manipulation Policies · 9 authors 1 1