ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents
Abstract
ResearchRubrics is a benchmark for evaluating deep research agents, using expert rubrics to assess their factual grounding, reasoning, and clarity across diverse, complex tasks.
Deep Research (DR) is an emerging agent application that leverages large language models (LLMs) to address open-ended queries. It requires the integration of several capabilities, including multi-step reasoning, cross-document synthesis, and the generation of evidence-backed, long-form answers. Evaluating DR remains challenging because responses are lengthy and diverse, admit many valid solutions, and often depend on dynamic information sources. We introduce ResearchRubrics, a standardized benchmark for DR built with over 2,800+ hours of human labor that pairs realistic, domain-diverse prompts with 2,500+ expert-written, fine-grained rubrics to assess factual grounding, reasoning soundness, and clarity. We also propose a new complexity framework for categorizing DR tasks along three axes: conceptual breadth, logical nesting, and exploration. In addition, we develop human and model-based evaluation protocols that measure rubric adherence for DR agents. We evaluate several state-of-the-art DR systems and find that even leading agents like Gemini's DR and OpenAI's DR achieve under 68% average compliance with our rubrics, primarily due to missed implicit context and inadequate reasoning about retrieved information. Our results highlight the need for robust, scalable assessment of deep research capabilities, to which end we release ResearchRubrics(including all prompts, rubrics, and evaluation code) to facilitate progress toward well-justified research assistants.
Community
ResearchRubrics introduces a standardized benchmark pairing prompts and rubrics to evaluate deep research agents, with a complexity framework and evaluation protocols uncovering current limitations in rubric compliance.
arXiv explained breakdown of this paper ๐ https://arxivexplained.com/papers/researchrubrics-a-benchmark-of-prompts-and-rubrics-for-evaluating-deep-research-agents
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality (2025)
- LPFQA: A Long-Tail Professional Forum-based Benchmark for LLM Evaluation (2025)
- Towards Personalized Deep Research: Benchmarks and Evaluations (2025)
- TRUEBench: Can LLM Response Meet Real-world Constraints as Productivity Assistant? (2025)
- Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective (2025)
- EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving (2025)
- SurveyBench: Can LLM(-Agents) Write Academic Surveys that Align with Reader Needs? (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
arXiv explained breakdown of this paper ๐ https://arxivexplained.com/papers/researchrubrics-a-benchmark-of-prompts-and-rubrics-for-evaluating-deep-research-agents
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper