Papers
arxiv:2602.02905

FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights

Published on Feb 2
· Submitted by
Zhen Wang
on Feb 4
Authors:
,
,
,
,
,
,
,
,
,

Abstract

Researchers developed FIRE-Bench, a comprehensive evaluation framework that challenges autonomous agents to rediscover established scientific findings through complete research cycles involving hypothesis generation, experimentation, coding, and evidence-based conclusion drawing.

AI-generated summary

Autonomous agents powered by large language models (LLMs) promise to accelerate scientific discovery end-to-end, but rigorously evaluating their capacity for verifiable discovery remains a central challenge. Existing benchmarks face a trade-off: they either heavily rely on LLM-as-judge evaluations of automatically generated research outputs or optimize convenient yet isolated performance metrics that provide coarse proxies for scientific insight. To address this gap, we introduce FIRE-Bench (Full-cycle Insight Rediscovery Evaluation), a benchmark that evaluates agents through the rediscovery of established findings from recent, high-impact machine learning research. Agents are given only a high-level research question extracted from a published, verified study and must autonomously explore ideas, design experiments, implement code, execute their plans, and derive conclusions supported by empirical evidence. We evaluate a range of state-of-the-art agents with frontier LLMs backbones like gpt-5 on FIRE-Bench. Our results show that full-cycle scientific research remains challenging for current agent systems: even the strongest agents achieve limited rediscovery success (<50 F1), exhibit high variance across runs, and display recurring failure modes in experimental design, execution, and evidence-based reasoning. FIRE-Bench provides a rigorous and diagnostic framework for measuring progress toward reliable agent-driven scientific discovery.

Community

Paper author Paper submitter

FIRE-Bench is a human-grounded benchmark designed to test whether AI can actually do science end-to-end, from ideation, planning, to implementation, execution, and conclusions. It converts recent, expert-validated scientific insights from top ML conferences into masked discovery challenges, forcing agents to rediscover human-verified insights rather than reproduce methods.

By anchoring open-ended exploration to human-grounded ground truth and evaluating discovery at the claim level, FIRE-Bench reveals a clear “science gap”: today’s best agents are less capable (<50 F1), unreliable, and fail mainly at planning and reasoning, not coding. The benchmark offers a scalable, structured way to convert a paper into a constrained discovery problem to measure progress toward reliable, creative, full-cycle scientific discovery, and targets a path toward live, continuously updated evaluation of research-capable AI.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.02905 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.02905 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.02905 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.