Agentic AI - Deeplearning.AI course summary
I recently took a course on agentic AI, taught by Andrew Ng at DeepLearning.AI. This blog post is a summary of what I learnt.
An agentic AI workflow is a process whereby an LLM-based application executes multiple steps to complete a task.
Examples:
- Providing customer support
- Doing research
- Processing legal documents
- Writing an essay

Some agents are more autonomous than others
At one end of the spectrum you have less autonomous agents, which:
- follow a pre-determined sequence of steps
- use tools in a hard-coded way
- only have autonomy in text generation.
Example: simple invoice processing:
- Identify required fields in invoice
- Record data in database
At the other end of the spectrum, you have more autonomous agents, which:
- work out their own sequence of steps in response to the input, and make many decisions autonomously
- can create new tools on the fly
Example: customer service agent:
- Given a wide range of possible user input, plans out the steps to respond (as the required steps are not known in advance), e.g. for:
- Stock enquiries (including writing code to answer the enquiry)
- Order return queries
- Stock enquiries (including writing code to answer the enquiry)
Benefits of agentic AI
- Much better performance than simply making an LLM call. In deeplearning.AI’s benchmarking, they found that implementing an agentic approach to a coding task led to a higher improvement in quality than the generational shift between GPT 3.5 and GPT 4.

- Parallelism (e.g. the ability to carry out multiple threads of web research and fetching at the same time, rather than having a single thread of thinking/execution)
- Modularity - allowing you to combine different components, and switch out different parts to get the best results
Some tasks are easier or harder to implement using AI agents:
| Easier | Harder |
|---|---|
| Clear, step-by-step process | Steps not known ahead of time |
| Standard procedures to follow | Plan / solve as you go |
| Text assets only | Multimodal (e.g. sound/vision) |
Task decomposition
Task decomposition is breaking down a workflow into steps and working out which building block to use for each step (see below) .
This can also help improve the quality of the results. e.g. instead of just saying ‘write an essay’ with some input, you could say ‘write a first draft’, then ‘consider which parts need revision’, then ‘revise your draft’. (And as you carry out evaluation, you can identify improvements to your workflow, which can include task decomposition)
Building blocks for creating workflows:
| Building block | Examples | Use cases |
|---|---|---|
| Models (shown as grey boxes in the diagrams) | LLMs | Text generation, tool use, information extraction |
| Other AI models | PDF-to-text, text-to-speech, image analysis | |
| Tools (shown as green boxes in the diagrams) | API | Web search, get real-time data, send email, check calendar |
| Information retrieval | Databases, retrieval augmented generation (RAG) | |
| Code execution | Basic calculator, data analysis | |
| Human input (shown as red boxes in the diagrams) | User query, or input document | Human input or review |
Evaluating agentic AI (evals):
A key determinant of success is how well you use evals to drive improvements. The basic loop is:
- Build the workflow (Andrew suggests getting a basic workflow up fast)
- Examine the output to see patterns of where it goes wrong
- Work out how to address recurring errors
- Add an eval to track how often they occur. (You can improve the evals over time as you spot more issues - and/or enhance the data set of tests)
There are end-to-end evals, component-level evals (another benefit of these systems being modular), and trace analysis.
Objective evals
Create a set of prompts and ground truth answers
compare the quality before and after making changes

Subjective evals
LLM as judge.
grading with a specific rubric helps here, e.g. Asking it to return a score of 0 or 1 on a range of factors, e.g. ‘has a clear title’, ‘axis labels present’, ‘appropriate chart type’, ‘axes use appropriate numerical range’.
I took a couple of courses on this topic: evaluating generative AI output and evaluating AI agents
How to work out where to focus your efforts improving an agentic system
Take a disciplined error analysis approach. The above tests are end-to-end, so you might need to carry out analysis of specific modular parts of your system to understand what is causing the end-to-end disappointing result. (e.g. a problem with the research agent output could be caused by things going wrong in a number of places)

So analyse the spans - the outputs of each step - to understand in more detail what is going on at each stage. (Trace is end-to-end)
Analyse the traces where the output of your system is unsatisfactory. This will help you spot how to improve it.
Record what is going on across all your subpar examples, which will allow you to count up which are most common. E.g.:

This can help you work out where to spend most of your effort in making improvements.
In addition to end-to-end evals, you can set up component-level evals too. These are cheaper, quicker, and can more specifically track the performance of a module in your bigger workflow.
How to improve LLM components:
- Improve your prompts:
- Add more explicit instructions
- Add one or more concrete examples to the prompt (few-shot prompting)
- Try and new model
- Try multiple LLMs and use evals to pick the best
- Split up the step into smaller steps
- Fine-tune a model, if the other methods don’t work (more complex and expensive, though)
How to improve non-LLM components:
- Tune hyperparameters of that component - e.g. number of results, date range, similarity threshold, chunk size
- Replace the component
Latency and cost optimisation
Get the output quality high first, then worry about these problems.
Latency:
- Time the steps
- Spot the biggest areas of slowness
- Spot opportunities for parallelism
- Try smaller/less intelligent models OR a faster LLM provider
Cost:
- Measure the tokens and call costs associated with each step
- Then reduce tokens and/or calls
Development process for an agentic product
- Loop between build and analysis
- Basic end-to-end build -> trend analysis -> improve individual components manually -> build evals and a small dataset to track end-to-end performance -> make improvements based on these insights -> make analysis more disciplined with component-level analysis -> drives work on specific individual components -> do analysis to make components more efficient
Agentic design patterns:
- Reflection
-
Getting the agent (or a separate critic agent) to reflect on its own output:
Get the LLM to come up with a way to approach a task
- Then ask it to review the approach for correctness, style and efficiency, and give constructive criticism on how to improve it. (You may also use a different model from the one that did the initial creation - reasoning agents are particularly good at reviewing)
- Get the agent to carry out the improvement
Advice for reflection prompts:
- Clearly indicate the reflection action
- Specify the criteria to check

You can compare performance of your workflow before and after implementing reflection.
-
Feeding in information from tools (e.g. output from unit tests or code errors)
Reflection is a consistent way to improve the quality of what is produced

-
-
Tool use
Tools are functions that the LLM can request to be executed. E.g. returning the current datatime as a string, or making an API call, or a database query.

They use the aisuite library, to make it easy to abstract away the LLM provider choice from your code. You provide the function(s) to the LLM, and as long as you have a docstring explaining what the function does, the aisuite library passes that description through so that the LLM knows when to call it.

Example tools:
- Analysis
- Code execution (use a sandbox environment (e.g. docker or E2B (lightweight)) to help protect against catastrophic errors)
- Wolfram alpha
- Bearly Code Interpreter
- Information gathering
- Web search
- Wikipedia
- Database access
- Productivity
- Calendar
- Messaging
- Images
- Image generation
- Image captioning
- OCR
- Analysis
-
Planning - working out the steps and sequence needed to achieve a goal. Harder and more experimental, but can give impressive performance.

Telling the agent to write the downstream plan as JSON is useful, as it structures it in a clear way:

An alternative, instead of getting agents to execute different steps in the plan through successive LLM calls, is to get the LLM to produce code that, when executed, will carry out the plan.
This can be more effective than just coming up with a JSON plan:

This area is quite cutting-edge.
-
Multi-agent collaboration (can be harder to control, but can result in better outcomes for complex tasks)

Common inter-agent communication approaches are to have a sequence of agents, or to have a managing agent calling the others as needed. You can also have deeper hierarchies. There are frameworks for setting up multi-agent systems.- E.g. ChatDev
- Or using researcher, marketer, editor for a wider marketing workflow.