Submitted by Luo2003 89 RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation · 14 authors 6
Submitted by taesiri 37 MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer · 27 authors 2
Submitted by fjxmlzn 28 Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification · 6 authors 27 2
Submitted by yifanzhang114 17 BaseReward: A Strong Baseline for Multimodal Reward Model · 15 authors 2
Submitted by taesiri 8 A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning · 10 authors 114 2
Submitted by fangli3 3 RGB-Only Supervised Camera Parameter Optimization in Dynamic Scenes · 3 authors 2
Submitted by dlion168 2 Do You Hear What I Mean? Quantifying the Instruction-Perception Gap in Instruction-Guided Expressive Text-To-Speech Systems · 5 authors 2
Submitted by liuzhan22 1 Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing · 6 authors 1
Submitted by TAESOO98 1 Towards Human-like Multimodal Conversational Agent by Generating Engaging Speech · 4 authors 1 2
Submitted by taesiri 1 Video2Roleplay: A Multimodal Dataset and Framework for Video-Guided Role-playing Agents · 7 authors 2
Submitted by tetrisd 1 WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers · 3 authors 2
Submitted by leolin9248 - Ask-to-Clarify: Resolving Instruction Ambiguity through Multi-turn Dialogue · 8 authors 3