SAND-Math: Using LLMs to Generate Novel, Difficult and Useful Mathematics Questions and Answers Paper • 2507.20527 • Published 3 days ago • 3 • 2
TTT-Bench: A Benchmark for Evaluating Reasoning Ability with Simple and Novel Tic-Tac-Toe-style Games Paper • 2506.10209 • Published Jun 11 • 2