ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering
Abstract
ALE-Bench evaluates AI systems on score-based algorithmic programming contests drawn from AtCoder, focusing on long-term iterative problem-solving in domains like package-delivery routing, crew scheduling, factory production, and power-grid balancing.
How well do AI systems perform in algorithm engineering for hard optimization problems in domains such as package-delivery routing, crew scheduling, factory production planning, and power-grid balancing? We introduce ALE-Bench, a new benchmark for evaluating AI systems on score-based algorithmic programming contests. Drawing on real tasks from the AtCoder Heuristic Contests, ALE-Bench presents optimization problems that are computationally hard and admit no known exact solution. Unlike short-duration, pass/fail coding benchmarks, ALE-Bench encourages iterative solution refinement over long time horizons. Our software framework supports interactive agent architectures that leverage test-run feedback and visualizations. Our evaluation of frontier LLMs revealed that while they demonstrate high performance on specific problems, a notable gap remains compared to humans in terms of consistency across problems and long-horizon problem-solving capabilities. This highlights the need for this benchmark to foster future AI advancements.
Community
ALE-Bench (ALgorithm Engineering Benchmark) is a next-generation LLM benchmark for algorithmic coding, designed to test long-horizon reasoning on complex problems through trial and error.
This is the first benchmark of its kind, built on past problems from AtCoder Heuristic Contests (AHC). Unlike conventional competition coding benchmarks, it features hard optimization problems whose true optima are computationally out of reach (e.g., NP-hard problems). Human participants spend weeks iteratively refining their programs to push their scores higher. ALE-Bench simulates an AI's participation in AHC to test whether the AI can, like top human experts, discover creative high-scoring solutions that are often unforeseen even by the organizers.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ICPC-Eval: Probing the Frontiers of LLM Reasoning with Competitive Programming Contests (2025)
- OPT-BENCH: Evaluating LLM Agent on Large-Scale Search Spaces Optimization Problems (2025)
- MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering (2025)
- GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents (2025)
- Afterburner: Reinforcement Learning Facilitates Self-Improving Code Efficiency Optimization (2025)
- SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving (2025)
- ModelingAgent: Bridging LLMs and Mathematical Modeling for Real-World Challenges (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper