Benchmarks and challenges
updated
PhD Knowledge Not Required: A Reasoning Challenge for Large Language
Models
Paper
• 2502.01584
• Published
• 9
CODESIM: Multi-Agent Code Generation and Problem Solving through
Simulation-Driven Planning and Debugging
Paper
• 2502.05664
• Published
• 24
Craw4LLM: Efficient Web Crawling for LLM Pretraining
Paper
• 2502.13347
• Published
• 30
Can Large Language Models Help Multimodal Language Analysis? MMLA: A
Comprehensive Benchmark
Paper
• 2504.16427
• Published
• 18
PHYBench: Holistic Evaluation of Physical Perception and Reasoning in
Large Language Models
Paper
• 2504.16074
• Published
• 36
V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric
Capabilities in Multimodal Large Language Models
Paper
• 2504.06148
• Published
• 13
DiaTool-DPO: Multi-Turn Direct Preference Optimization for
Tool-Augmented Large Language Models
Paper
• 2504.02882
• Published
• 7
Pixels, Patterns, but No Poetry: To See The World like Humans
Paper
• 2507.16863
• Published
• 69
DeepResearch Arena: The First Exam of LLMs' Research Abilities via
Seminar-Grounded Tasks
Paper
• 2509.01396
• Published
• 58
Symbolic Graphics Programming with Large Language Models
Paper
• 2509.05208
• Published
• 47