From KMMLU-Redux to KMMLU-Pro: A Professional Korean Benchmark Suite for LLM Evaluation Paper • 2507.08924 • Published 16 days ago • 17
REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards Paper • 2505.24760 • Published May 30 • 66
REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards Paper • 2505.24760 • Published May 30 • 66
Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information Paper • 2505.06046 • Published May 9 • 15
LLMs Do Not Think Step-by-step In Implicit Reasoning Paper • 2411.15862 • Published Nov 24, 2024 • 10
GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models Paper • 2411.05830 • Published Nov 5, 2024 • 22
Attention Is All You Need But You Don't Need All Of It For Inference of Large Language Models Paper • 2407.15516 • Published Jul 22, 2024 • 1
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions Paper • 2406.15877 • Published Jun 22, 2024 • 48
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions Paper • 2406.15877 • Published Jun 22, 2024 • 48
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing Paper • 2406.08464 • Published Jun 12, 2024 • 70