Submitted by huangsiteng 124 VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model · 16 authors 3
Submitted by TianxiangMa 94 HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning · 10 authors 202 2
Submitted by Haozhan72 60 SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning · 21 authors 519 2
Submitted by Yoohao 54 EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs · 7 authors 14 3
Submitted by taesiri 40 Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis · 14 authors 2
Submitted by Jarvis1111 35 Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents · 10 authors 2
Submitted by taesiri 30 FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark · 10 authors 46 2
Submitted by LanguageBind 26 Can Understanding and Generation Truly Benefit Together -- or Just Coexist? · 14 authors 2
Submitted by HaoyuDong 22 MachineLearningLM: Continued Pretraining Language Models on Millions of Synthetic Tabular Prediction Tasks Scales In-Context ML · 5 authors 11 3
Submitted by taesiri 18 SpatialVID: A Large-Scale Video Dataset with Spatial Annotations · 15 authors 164 2
Submitted by amant555 18 AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs · 8 authors 35 3
Submitted by ManTle 8 Visual Programmability: A Guide for Code-as-Thought in Chart Understanding · 9 authors 12 2
Submitted by orionweller 7 mmBERT: A Modern Multilingual Encoder with Annealed Language Learning · 6 authors 2
Submitted by moak7 7 Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes · 5 authors 8 2
Submitted by Kaichengalex 6 Gradient-Attention Guided Dual-Masking Synergetic Framework for Robust Text-based Person Retrieval · 6 authors 10 2
Submitted by learn12138 5 2D Gaussian Splatting with Semantic Alignment for Image Inpainting · 4 authors 2
Submitted by taesiri 3 LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering · 17 authors 6 2
Submitted by taesiri 3 OmniEVA: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning · 13 authors 2
Submitted by Bryceee 3 Towards Better Dental AI: A Multimodal Benchmark and Instruction Dataset for Panoramic X-ray Analysis · 10 authors 11 2
Submitted by oravus 2 ObjectReact: Learning Object-Relative Control for Visual Navigation · 8 authors 4 1
Submitted by weipang142857 2 The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward · 10 authors 2
Submitted by mmock 1 Cross-Domain Evaluation of Transformer-Based Vulnerability Detection on Open & Industry Data · 3 authors 2
Submitted by renkelin 1 Modality Alignment with Multi-scale Bilateral Attention for Multimodal Recommendation · 3 authors 1 2
Submitted by Kitxuu 1 All You Need Is A Fuzzing Brain: An LLM-Powered System for Automated Vulnerability Detection and Patching · 8 authors 29 2
Submitted by iliashum 1 Reasoning Introduces New Poisoning Attacks Yet Makes Them More Complicated · 7 authors 3