Papers
arxiv:2510.03696

Mind the Goal: Data-Efficient Goal-Oriented Evaluation of Conversational Agents and Chatbots using Teacher Models

Published on Oct 4
Authors:
,
,
,

Abstract

A framework for evaluating goal-oriented success in multi-agent chatbots introduces Goal Success Rate and Root Cause of Failure taxonomy, using teacher LLMs and thinking tokens to improve evaluation quality and system performance.

AI-generated summary

Evaluating the quality of multi-turn chatbot interactions remains challenging, as most existing methods assess interactions at the turn level without addressing whether a user's overarching goal was fulfilled. A ``goal'' here refers to an information need or task, such as asking for policy information or applying for leave. We propose a comprehensive framework for goal-oriented evaluation of multi-agent systems (MAS), introducing the Goal Success Rate (GSR) to measure the percentage of fulfilled goals, and a Root Cause of Failure (RCOF) taxonomy to identify reasons for failure in multi-agent chatbots. Our method segments conversations by user goals and evaluates success using all relevant turns. We present a model-based evaluation system combining teacher LLMs, where domain experts define goals, set quality standards serving as a guidance for the LLMs. The LLMs use ``thinking tokens'' to produce interpretable rationales, enabling explainable, data-efficient evaluations. In an enterprise setting, we apply our framework to evaluate AIDA, a zero-to-one employee conversational agent system built as a ground-up multi-agent conversational agent, and observe GSR improvement from 63\% to 79\% over six months since its inception. Our framework is generic and offers actionable insights through a detailed defect taxonomy based on analysis of failure points in multi-agent chatbots, diagnosing overall success, identifying key failure modes, and informing system improvements.

Community

Evaluation of agents is a complex topic, usually done using HELM, Helpfulness or LLM as a judge approach. There are very few work that discuss evaluation of agent from a goal perspective or task perspective. Our work provides a continous improvement framework for AI agents through

  • goal oriented evaluation
  • comprehensive error taxonomy and root causing
  • data efficient uses test-time scaling to avoid human efforts

would love to hear thoughts on the framework and suggestions !

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.03696 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.03696 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.03696 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.