CooperBench: Why Coding Agents Cannot be Your Teammates Yet Paper • 2601.13295 • Published Jan 19 • 3
Towards Comprehensive Semantic Speech Embeddings for Chinese Dialects Paper • 2601.07274 • Published Jan 12 • 1
Imprecise Label Learning: A Unified Framework for Learning with Various Imprecise Label Configurations Paper • 2305.12715 • Published May 22, 2023
Measuring Sycophancy of Language Models in Multi-turn Dialogues Paper • 2505.23840 • Published May 28, 2025 • 2
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning Paper • 2507.00432 • Published Jul 1, 2025 • 79
OptimalThinkingBench: Evaluating Over and Underthinking in LLMs Paper • 2508.13141 • Published Aug 18, 2025
VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding Paper • 2509.21451 • Published Sep 25, 2025
SPICE: Self-Play In Corpus Environments Improves Reasoning Paper • 2510.24684 • Published Oct 28, 2025 • 18
DesignPref: Capturing Personal Preferences in Visual Design Generation Paper • 2511.20513 • Published Nov 25, 2025
PWESuite: Phonetic Word Embeddings and Tasks They Facilitate Paper • 2304.02541 • Published Apr 5, 2023 • 2
Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks Paper • 2411.05361 • Published Nov 8, 2024 • 5
POWSM: A Phonetic Open Whisper-Style Speech Foundation Model Paper • 2510.24992 • Published Oct 28, 2025 • 4
RefineBench: Evaluating Refinement Capability of Language Models via Checklists Paper • 2511.22173 • Published Nov 27, 2025 • 15
TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action Paper • 2505.01583 • Published May 2, 2025 • 8