How to evaluate AI agents
I recently took a course on Evaluating AI Agents. This is a summary of what I learnt
What is an AI Agent?
AI Agents use generative AI-driven reasoning, and take actions on a user’s behalf.
They do three things:
- Reasoning: Powered by an LLM to understand requests and plan actions.
- Routing: Interpreting the request to determine the correct tool or skill to use.
- Action: Executing code, tools, or APIs to fulfill the request.
When evaluting AI agents, we need to cover the whole process: the tool selection, API call correctness, use of context, and the overall correctness of the result.
Agent Architecture
A typical agent consists of three main components:
- Router: The central planner that decides which skill or function to call. This could be an LLM, a classifier, or rule-based code.
- Skills/Tools: The capabilities the agent possesses. A skill is a chain of logic to complete a task, which can involve LLM calls or API calls, or similar. A common example is a RAG (Retrieval-Augmented Generation) skill.
- Memory and State: A shared state accessible by all components, used to store context, configuration, and execution history.

Observability: Understanding Agent Behavior
To evaluate an agent, you first need to understand what it’s doing internally. Observability is often achieved using standards like Open Telemetry (OTEL), which logs the agent’s execution path.
- Traces: Represent an end-to-end run-through of the agent
- Spans: Represent data captured on individual steps within a trace (e.g., a single tool call or LLM call).
Tools like Arize Phoenix can help automate the instrumentation of these traces, providing a detailed log for debugging and performance evaluation.
Evaluation Techniques
There are three primary methods for evaluating agent components:
- Code-Based Evaluators: Similar to traditional software testing. Use code (e.g., regex matching, JSON parsing, checking against a known correct output) to validate the agent’s output. Best for deterministic, inflexible outputs.
- LLM-as-a-Judge: Use a powerful LLM to judge a specific dimension of your agent’s output (e.g., relevance, correctness). This is flexible but will not be 100% accurate. It’s better to use discrete classification labels (e.g., ‘correct’/’incorrect’) rather than continuous scores.
- Human Annotation: Have humans label traces with feedback (e.g., thumbs up/down). This is high-quality but labor-intensive and hard to scale.

What to Evaluate in an Agent
Evaluation should target the different parts of the agent’s process.
1. Router Evaluation
- Function Calling Choice: Did the router select the correct skill/tool for the user’s query?
- Parameter Extraction: Did it correctly extract the necessary parameters from the user’s input for the chosen function?

2. Skill Evaluation
The evaluation method depends on the type of skill:
- Deterministic Skills: Use code-based evaluations (e.g., does the output parse correctly?).
- Non-Deterministic Skills (LLM-based): Use an LLM-as-a-judge to evaluate aspects like relevance, hallucination, correctness, or readability.

3. Agent Trajectory Evaluation
The agent trajectory is the path of steps the agent took. Evaluating it is a measure of efficiency.
Convergence measures how closely the agent follows the optimal path for a given query. The goal is to reduce unnecessary steps, which lowers cost, latency, and variability.
Evaluation-Driven Development
This is an iterative process where evaluation guides the development and improvement of your agent.
The core loop is:
- Curate a Dataset: Collect a comprehensive set of test cases with expected inputs.
- Track Experiments: Run the dataset through different versions of your agent (e.g., with a new prompt, model, or tool) and record the results as an “experiment”.
- Evaluate: Run your evaluators on the experiment’s results.
- Compare: Compare the evaluation scores across different versions to determine which changes led to improvements.

This process creates a cycle of feedback, where insights from testing and even production can be used to refine the agent and the test dataset itself:
