Building a multi-agent AI Objectives and Key Results coach
I recently took a course on agentic AI, and built an AI agent to test out some of the new things that I learnt.

I wanted to create a simple agent with non-deterministic routing and use of sub-agents:
-
Non-deterministic routing means that instead of a traditional piece of software, where the flow of steps taken is based on logical rules set in advance, the flow of execution would be based on decisions made by the agent itself.
-
Use of sub-agents means that there would be one overarching agent, addressing the overall task, but it would be able to call on other, specialist, agents to carry out specific tasks. Think of the overarching agent as a manager and the sub-agents as specialists working to carry out tasks to pursue the manager’s goal. Combined with non-deterministic routing, this meant that I’d call the overarching agent with a task to carry out, and it would work out which agents to use. This would allow me to focus those agents, and the prompts that shape their action, on being good at their narrow, specific tasks.
Generative AI is now at the stage where we have access to pretty good basic cognition. In a huge number of scenarios, this is more than enough to do something very useful. I picked a simple task for my agent: I wanted it to review draft Objectives and Key Results for digital product teams. I’m following the approach outlined by Christina Wodtke in her book Radical Focus. Objectives and Key Results are a technique for setting goals and working out how you will measure whether you’re achieving them.
There are some simple best practices for Objectives and Key Results outlined by Wodtke:
- Objectives should be inspiring, qualitative and timebound.
- Key Results should be specific, measurable and ambitious. These simple rules, and the fact that Objectives and Key Results are short pieces of text, makes them an easy candidate for this type of application. I wanted the agent to assess draft Objectives and Key results against these rules, to highlight problems and suggest improvements.
I worked with Gemini CLI (using Gemini 3 Pro Preview) to build the agent, and wanted a very simple setup. I didn’t use any frameworks like LangChain or the aistudio library - just some python invoking Gemini CLI instances for each agent. (Usually you’d just make an API call to a model provider, but invoking Gemini CLI means you can try this out without having to pay. The simplicity of this agent presumably makes it a good candidate for running locally, but my computer isn’t powerful enough for this to be a viable option)
My plan for the architecture was that an overarching ‘coordinator’ agent would manage the end-to-end review process, deciding which sub-agents to call on as necessary. I ended up with the following sub-agents:
- Objective reviewer
- Objective improver
- Key result reviewer
- Key result improver

An alternative coordinating approach would have been to have the sub-agents powered by LLMs, but the overall coordination driven by deterministic logic in python - e.g.
if Objective reviewer agent suggests need for improvement:
call the Objective improver agent
else:
tell the user that the Objective is good
As simple code running locally, this would have been much faster than a coordinator agent running through an LLM call, with the associated latency - but my goal here was for the agent to be flexible, and to test this new coordination model, so I was happy with this tradeoff. This flexibility allows me to pass the coordinating agent a wider variety of inputs: an Objective, a Key Result, or an Objective and associated Key Results, without having to rigidly work out how to parse this input in advance from a string of user input.
I was using Gemini CLI with the Gemini 3 Pro Preview model to do the programming, which mostly worked well. I had make some changes to get the Objective reviewer set up properly (Gemini initially specified that the Objectives need to be SMART, but that’s not actually how they work in the Objectives and Key Results framework) but this was easy.




I’ve learnt a lot about evaluating AI agents in the last 6 months (see these notes on how to evaluate generative AI output, and these notes on how to evaluate AI agents so I wanted to implement some evaluation here too.
I started with the simple stuff - deterministic tests like you’d have in traditional software. We set up Unit Tests, mocking the query_gemini function. These were useful for testing the Python logic (e.g., “Does the parser handle None values correctly?”), but they didn’t test the intelligence.
To test the intelligence, I wanted to use the ‘LLM as a judge’ approach, to allow for non-deterministic evaluation of non-deterministic responses.
There was an interesting nuance whereby the LLM as judge was initially setup incorrectly, so it was judging good output to be bad (e.g. considering an Objective without a number to be vague, even though the Objectives and Key Results framework specifies qualitative Objective and quantitative Key Results). So we refined the judge’s criteria prompt to address this.
Another failure occurred when the Judge simply politely acknowledged the prompt (“Okay, I will judge.”) instead of giving a verdict. We had to make the prompt imperative: “JUDGE IMMEDIATELY. Do not converse.”
This shows that if you’re using an LLM-a-judge evaluation approach, you’ll probably need to iterate your evaluation prompts just as much as your actual agent’s prompts.
If you’d like to try out the okr coach, check out the GitHub repository. You’ll need Gemini CLI and Python 3 setup locally.
