How to evaluate AI agents

I recently took a course on Evaluating AI Agents. This is a summary of what I learnt

What is an AI Agent?

AI Agents use generative AI-driven reasoning, and take actions on a user’s behalf.

They do three things:

Reasoning: Powered by an LLM to understand requests and plan actions.
Routing: Interpreting the request to determine the correct tool or skill to use.
Action: Executing code, tools, or APIs to fulfill the request.

When evaluting AI agents, we need to cover the whole process: the tool selection, API call correctness, use of context, and the overall correctness of the result.

Agent Architecture

A typical agent consists of three main components:

Router: The central planner that decides which skill or function to call. This could be an LLM, a classifier, or rule-based code.
Skills/Tools: The capabilities the agent possesses. A skill is a chain of logic to complete a task, which can involve LLM calls or API calls, or similar. A common example is a RAG (Retrieval-Augmented Generation) skill.
Memory and State: A shared state accessible by all components, used to store context, configuration, and execution history.

A diagram showing the components of an example agent

Observability: Understanding Agent Behavior

To evaluate an agent, you first need to understand what it’s doing internally. Observability is often achieved using standards like Open Telemetry (OTEL), which logs the agent’s execution path.

Traces: Represent an end-to-end run-through of the agent
Spans: Represent data captured on individual steps within a trace (e.g., a single tool call or LLM call).

Tools like Arize Phoenix can help automate the instrumentation of these traces, providing a detailed log for debugging and performance evaluation.

Evaluation Techniques

There are three primary methods for evaluating agent components:

Code-Based Evaluators: Similar to traditional software testing. Use code (e.g., regex matching, JSON parsing, checking against a known correct output) to validate the agent’s output. Best for deterministic, inflexible outputs.
LLM-as-a-Judge: Use a powerful LLM to judge a specific dimension of your agent’s output (e.g., relevance, correctness). This is flexible but will not be 100% accurate. It’s better to use discrete classification labels (e.g., ‘correct’/’incorrect’) rather than continuous scores.
Human Annotation: Have humans label traces with feedback (e.g., thumbs up/down). This is high-quality but labor-intensive and hard to scale.

A diagram showing when to use each evaluation technique

What to Evaluate in an Agent

Evaluation should target the different parts of the agent’s process.

1. Router Evaluation

Function Calling Choice: Did the router select the correct skill/tool for the user’s query?
Parameter Extraction: Did it correctly extract the necessary parameters from the user’s input for the chosen function?

2. Skill Evaluation

The evaluation method depends on the type of skill:

Deterministic Skills: Use code-based evaluations (e.g., does the output parse correctly?).
Non-Deterministic Skills (LLM-based): Use an LLM-as-a-judge to evaluate aspects like relevance, hallucination, correctness, or readability.

A diagram showing how to evaluate skills in an example agent

3. Agent Trajectory Evaluation

The agent trajectory is the path of steps the agent took. Evaluating it is a measure of efficiency.

Convergence measures how closely the agent follows the optimal path for a given query. The goal is to reduce unnecessary steps, which lowers cost, latency, and variability.

Evaluation-Driven Development

This is an iterative process where evaluation guides the development and improvement of your agent.

The core loop is:

Curate a Dataset: Collect a comprehensive set of test cases with expected inputs.
Track Experiments: Run the dataset through different versions of your agent (e.g., with a new prompt, model, or tool) and record the results as an “experiment”.
Evaluate: Run your evaluators on the experiment’s results.
Compare: Compare the evaluation scores across different versions to determine which changes led to improvements.

A dashboard in Arize Phoenix showing metrics on different versions of an agent

This process creates a cycle of feedback, where insights from testing and even production can be used to refine the agent and the test dataset itself: A diagram showing the development and production process