Papers
arxiv:2506.09050

ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering

Published on Jun 10
· Submitted by iwiwi on Jun 17
Authors:
,
,
,
,

Abstract

ALE-Bench evaluates AI systems on score-based algorithmic programming contests drawn from AtCoder, focusing on long-term iterative problem-solving in domains like package-delivery routing, crew scheduling, factory production, and power-grid balancing.

AI-generated summary

How well do AI systems perform in algorithm engineering for hard optimization problems in domains such as package-delivery routing, crew scheduling, factory production planning, and power-grid balancing? We introduce ALE-Bench, a new benchmark for evaluating AI systems on score-based algorithmic programming contests. Drawing on real tasks from the AtCoder Heuristic Contests, ALE-Bench presents optimization problems that are computationally hard and admit no known exact solution. Unlike short-duration, pass/fail coding benchmarks, ALE-Bench encourages iterative solution refinement over long time horizons. Our software framework supports interactive agent architectures that leverage test-run feedback and visualizations. Our evaluation of frontier LLMs revealed that while they demonstrate high performance on specific problems, a notable gap remains compared to humans in terms of consistency across problems and long-horizon problem-solving capabilities. This highlights the need for this benchmark to foster future AI advancements.

Community

Paper author Paper submitter
edited Jun 17

ALE-Bench (ALgorithm Engineering Benchmark) is a next-generation LLM benchmark for algorithmic coding, designed to test long-horizon reasoning on complex problems through trial and error.

This is the first benchmark of its kind, built on past problems from AtCoder Heuristic Contests (AHC). Unlike conventional competition coding benchmarks, it features hard optimization problems whose true optima are computationally out of reach (e.g., NP-hard problems). Human participants spend weeks iteratively refining their programs to push their scores higher. ALE-Bench simulates an AI's participation in AHC to test whether the AI can, like top human experts, discover creative high-scoring solutions that are often unforeseen even by the organizers.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.09050 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.09050 in a Space README.md to link it from this page.

Collections including this paper 1