Papers
arxiv:2410.10934

Agent-as-a-Judge: Evaluate Agents with Agents

Published on Oct 14, 2024
· Submitted by Ksgk-fy on Oct 16, 2024
Authors:
,
,
,
,
,
,
,
,
,
,
,

Abstract

Agent-as-a-Judge framework uses agentic systems for evaluation, offering intermediate feedback and reliability improvements over existing methods like LLM-as-a-Judge in tasks such as code generation.

AI-generated summary

Contemporary evaluation techniques are inadequate for agentic systems. These approaches either focus exclusively on final outcomes -- ignoring the step-by-step nature of agentic systems, or require excessive manual labour. To address this, we introduce the Agent-as-a-Judge framework, wherein agentic systems are used to evaluate agentic systems. This is an organic extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving process. We apply the Agent-as-a-Judge to the task of code generation. To overcome issues with existing benchmarks and provide a proof-of-concept testbed for Agent-as-a-Judge, we present DevAI, a new benchmark of 55 realistic automated AI development tasks. It includes rich manual annotations, like a total of 365 hierarchical user requirements. We benchmark three of the popular agentic systems using Agent-as-a-Judge and find it dramatically outperforms LLM-as-a-Judge and is as reliable as our human evaluation baseline. Altogether, we believe that Agent-as-a-Judge marks a concrete step forward for modern agentic systems -- by providing rich and reliable reward signals necessary for dynamic and scalable self-improvement.

Community

Paper submitter

Equip LLM Judges with tools -- Agent as a Judge

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Hi all,
I’ve been reading the Agent-as-a-Judge paper and found it quite interesting. One thing I’m unclear about is how this framework would generalize to real-world queries that don’t come with predefined requirements or dependencies (unlike in the DevAI benchmark).

The paper doesn’t go into much detail on this. How do you think requirements could be collected or inferred in a more open-ended, practical setting? Would it be an additional agent step, or maybe structured prompting?
Curious to hear how others are thinking about this.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2410.10934 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2410.10934 in a Space README.md to link it from this page.

Collections including this paper 10