Agentic AI - Deeplearning.AI course summary

I recently took a course on agentic AI, taught by Andrew Ng at DeepLearning.AI. This blog post is a summary of what I learnt.

An agentic AI workflow is a process whereby an LLM-based application executes multiple steps to complete a task.

Examples:

Providing customer support
Doing research
Processing legal documents
Writing an essay

Some agents are more autonomous than others

At one end of the spectrum you have less autonomous agents, which:

follow a pre-determined sequence of steps
use tools in a hard-coded way
only have autonomy in text generation.

Example: simple invoice processing:

Identify required fields in invoice
- Record data in database

At the other end of the spectrum, you have more autonomous agents, which:

work out their own sequence of steps in response to the input, and make many decisions autonomously
can create new tools on the fly

Example: customer service agent:

Given a wide range of possible user input, plans out the steps to respond (as the required steps are not known in advance), e.g. for:
- Stock enquiries (including writing code to answer the enquiry)
  - Order return queries

Benefits of agentic AI

Much better performance than simply making an LLM call. In deeplearning.AI’s benchmarking, they found that implementing an agentic approach to a coding task led to a higher improvement in quality than the generational shift between GPT 3.5 and GPT 4.
Parallelism (e.g. the ability to carry out multiple threads of web research and fetching at the same time, rather than having a single thread of thinking/execution)
Modularity - allowing you to combine different components, and switch out different parts to get the best results

Some tasks are easier or harder to implement using AI agents:

Easier	Harder
Clear, step-by-step process	Steps not known ahead of time
Standard procedures to follow	Plan / solve as you go
Text assets only	Multimodal (e.g. sound/vision)

Task decomposition

Task decomposition is breaking down a workflow into steps and working out which building block to use for each step (see below) .

This can also help improve the quality of the results. e.g. instead of just saying ‘write an essay’ with some input, you could say ‘write a first draft’, then ‘consider which parts need revision’, then ‘revise your draft’. (And as you carry out evaluation, you can identify improvements to your workflow, which can include task decomposition)

Building blocks for creating workflows:

Building block	Examples	Use cases
Models (shown as grey boxes in the diagrams)	LLMs	Text generation, tool use, information extraction
	Other AI models	PDF-to-text, text-to-speech, image analysis
Tools (shown as green boxes in the diagrams)	API	Web search, get real-time data, send email, check calendar
	Information retrieval	Databases, retrieval augmented generation (RAG)
	Code execution	Basic calculator, data analysis
Human input (shown as red boxes in the diagrams)	User query, or input document	Human input or review

Evaluating agentic AI (evals):

A key determinant of success is how well you use evals to drive improvements. The basic loop is:

Build the workflow (Andrew suggests getting a basic workflow up fast)
Examine the output to see patterns of where it goes wrong
Work out how to address recurring errors
Add an eval to track how often they occur. (You can improve the evals over time as you spot more issues - and/or enhance the data set of tests)

There are end-to-end evals, component-level evals (another benefit of these systems being modular), and trace analysis.

Objective evals

Create a set of prompts and ground truth answers
compare the quality before and after making changes

Subjective evals

LLM as judge.
grading with a specific rubric helps here, e.g. Asking it to return a score of 0 or 1 on a range of factors, e.g. ‘has a clear title’, ‘axis labels present’, ‘appropriate chart type’, ‘axes use appropriate numerical range’.

I took a couple of courses on this topic: evaluating generative AI output and evaluating AI agents

How to work out where to focus your efforts improving an agentic system

Take a disciplined error analysis approach. The above tests are end-to-end, so you might need to carry out analysis of specific modular parts of your system to understand what is causing the end-to-end disappointing result. (e.g. a problem with the research agent output could be caused by things going wrong in a number of places)

So analyse the spans - the outputs of each step - to understand in more detail what is going on at each stage. (Trace is end-to-end)
Analyse the traces where the output of your system is unsatisfactory. This will help you spot how to improve it.

Record what is going on across all your subpar examples, which will allow you to count up which are most common. E.g.:

This can help you work out where to spend most of your effort in making improvements.

In addition to end-to-end evals, you can set up component-level evals too. These are cheaper, quicker, and can more specifically track the performance of a module in your bigger workflow.

How to improve LLM components:

Improve your prompts:
- Add more explicit instructions
- Add one or more concrete examples to the prompt (few-shot prompting)
Try and new model
- Try multiple LLMs and use evals to pick the best
Split up the step into smaller steps
Fine-tune a model, if the other methods don’t work (more complex and expensive, though)

How to improve non-LLM components:

Tune hyperparameters of that component - e.g. number of results, date range, similarity threshold, chunk size
Replace the component

Latency and cost optimisation

Get the output quality high first, then worry about these problems.

Latency:

Time the steps
Spot the biggest areas of slowness
Spot opportunities for parallelism
Try smaller/less intelligent models OR a faster LLM provider

Cost:

Measure the tokens and call costs associated with each step
Then reduce tokens and/or calls

Development process for an agentic product

Loop between build and analysis
Basic end-to-end build -> trend analysis -> improve individual components manually -> build evals and a small dataset to track end-to-end performance -> make improvements based on these insights -> make analysis more disciplined with component-level analysis -> drives work on specific individual components -> do analysis to make components more efficient

Agentic design patterns:

Reflection
1. Getting the agent (or a separate critic agent) to reflect on its own output:
  
  Get the LLM to come up with a way to approach a task
  1. Then ask it to review the approach for correctness, style and efficiency, and give constructive criticism on how to improve it. (You may also use a different model from the one that did the initial creation - reasoning agents are particularly good at reviewing)
  2. Get the agent to carry out the improvement
  Advice for reflection prompts:
  1. Clearly indicate the reflection action
  2. Specify the criteria to check
You can compare performance of your workflow before and after implementing reflection.
1. Feeding in information from tools (e.g. output from unit tests or code errors)
  
  Reflection is a consistent way to improve the quality of what is produced
Tool use

Tools are functions that the LLM can request to be executed. E.g. returning the current datatime as a string, or making an API call, or a database query.

They use the aisuite library, to make it easy to abstract away the LLM provider choice from your code. You provide the function(s) to the LLM, and as long as you have a docstring explaining what the function does, the aisuite library passes that description through so that the LLM knows when to call it.

Example tools:
1. Analysis
  1. Code execution (use a sandbox environment (e.g. docker or E2B (lightweight)) to help protect against catastrophic errors)
  2. Wolfram alpha
  3. Bearly Code Interpreter
2. Information gathering
  1. Web search
  2. Wikipedia
  3. Database access
3. Productivity
  1. Email
  2. Calendar
  3. Messaging
4. Images
  1. Image generation
  2. Image captioning
  3. OCR
Planning - working out the steps and sequence needed to achieve a goal. Harder and more experimental, but can give impressive performance.

Telling the agent to write the downstream plan as JSON is useful, as it structures it in a clear way:

An alternative, instead of getting agents to execute different steps in the plan through successive LLM calls, is to get the LLM to produce code that, when executed, will carry out the plan.

This can be more effective than just coming up with a JSON plan:

This area is quite cutting-edge.
Multi-agent collaboration (can be harder to control, but can result in better outcomes for complex tasks)

Common inter-agent communication approaches are to have a sequence of agents, or to have a managing agent calling the others as needed. You can also have deeper hierarchies. There are frameworks for setting up multi-agent systems.
1. E.g. ChatDev
2. Or using researcher, marketer, editor for a wider marketing workflow.