From KMMLU-Redux to KMMLU-Pro: A Professional Korean Benchmark Suite for LLM Evaluation Paper • 2507.08924 • Published 19 days ago • 17
BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation Paper • 2506.00482 • Published May 31 • 8
When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Research Paper • 2505.11855 • Published May 17 • 10
ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition Paper • 2503.21248 • Published Mar 27 • 21
PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning Paper • 2502.12054 • Published Feb 17 • 7
Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning Paper • 2502.17407 • Published Feb 24 • 26
view article Article Navigating Korean LLM Research #2: Evaluation Tools By amphora • Oct 23, 2024 • 8
HAE-RAE Bench: Evaluation of Korean Knowledge in Language Models Paper • 2309.02706 • Published Sep 6, 2023 • 2