How to Build an AI Agent
Building a functional AI agent from scratch in 48 hours - tool calling loops, context management, and the agentic workflow pattern.
- 1 Agents are fundamentally a loop: prompt → tool decision → execution → repeat
- 2 The model makes decisions, not just generates text - that's what makes it agentic
- 3 Start simple: file reading + a while loop teaches you more than any framework
- 4 Tool descriptions matter enormously for model decision quality
- 1 Manual testing doesn't scale - evals let you test dozens of scenarios with one command
- 2 Use an LLM as a judge to score AI output against expected criteria (0-10)
- 3 Evals make model comparison easy - benchmark accuracy, latency, and cost at a glance
- 4 Write 10-15 realistic test cases including edge cases that are likely to trip up your AI
- 1 Dynamic system prompts and tool selection let you use much cheaper models without losing accuracy
- 2 Tool calls count as separate requests - factor that into cost calculations from the start
- 3 Two to three cheap models working together can be more efficient than one expensive model
- 4 Use Claude Code as a research partner to architect solutions before implementing
What Makes an Agent?
At its core, an AI agent is deceptively simple: it’s a loop. You give the model a task, it decides what tool to use, you execute that tool, feed the result back, and repeat until the task is complete. That’s it. No magic, no complex orchestration frameworks - just a while loop and some JSON parsing.
The key insight is that the model is making decisions. It’s not just generating text - it’s choosing actions based on context. That decision-making capability is what separates an agent from a simple chatbot. And that loop? That’s where all the interesting engineering problems live.
The Simplest Possible Agent
Let’s build one from scratch. We’ll create an agent that can read files and answer questions about a codebase. No LangChain, no AutoGPT, no frameworks - just the raw API and a while loop.
First, you need a system prompt that tells the model what tools it has access to. Something like: “You have access to a read_file tool that takes a file path and returns its contents.” Then you parse the model’s response to see if it wants to call a tool. If so, execute it and continue. If not, you’re done.
Why Tool Descriptions Matter
Here’s something that surprised me: the quality of your tool descriptions has an enormous impact on agent performance. A vague description like “reads a file” performs significantly worse than “Reads the contents of a file at the given path. Use this to examine source code, configuration files, or any text file in the repository.”
The model uses these descriptions to decide when to use each tool. Better descriptions lead to better decisions, which leads to faster task completion and fewer errors.
Why Manual Testing Breaks Down
When your agent handles three actions, manually testing after each change is manageable. But agents grow fast. Add calendar integration, task updates, label management, recurring tasks, and time boxing - suddenly you’re looking at dozens of scenarios that all interact with each other. Fix one and you break another. The number of things that can go wrong grows exponentially with complexity, and manual testing simply can’t keep up.
This is the exact problem evals solve. Instead of testing two or three scenarios by hand and hoping nothing else broke, you define a comprehensive set of test cases and run them all with a single command. If something regresses, you catch it before it reaches production - not after users start reporting bugs.
How Evals Work
The eval system works in three phases: setup, execution, and judgment. First, each test case spins up a dummy account populated with specific test data - tasks, labels, calendars, whatever the scenario requires. Then the system runs a user request through the AI agent exactly like a real user would, capturing every detail: which tools were called, what the account looked like before and after, latency, and token usage.
The clever part is the judgment phase. An LLM acts as a judge, reviewing the captured output against evaluation criteria you define in plain English. Instead of checking for an exact string match (which fails with non-deterministic AI output), the judge scores how closely the agent met your expectations on a scale of 0 to 10. All of this gets compiled into an HTML report you can scan at a glance - scores, reasoning, and full debug logs for anything that needs a closer look.
Comparing Models with Evals
One of the most powerful things evals enable is quick model comparison. Swap out the model, run the same test suite, and you get an apples-to-apples benchmark of accuracy, latency, and cost. When GPT-5 Mini launched, running it through the eval suite immediately showed it had higher accuracy than Gemini Flash on the same tasks - but six times the latency. For a chat application where users expect near-instant responses, that latency difference matters as much as accuracy.
Without evals, this kind of comparison would take days of manual testing and gut-feel judgments. With them, you get concrete data in a single run.
Writing Good Test Cases
Aim for 10 to 15 realistic test cases that cover your agent’s core functionality and likely failure modes. Each test case needs an input (what the user asks), a setup (what data exists in the test account), and evaluation criteria (what the judge should look for in the output).
The criteria are where most of the value lives. Don’t just check “did it label the tasks” - specify which labels should go on which tasks. The more precise your criteria, the more useful your scores become. Include edge cases that are likely to trip up the AI: synonyms (asking about a “shopping list” when the label is “grocery”), ambiguous time references, and multi-step operations where order matters.
Use Claude Code to speed up test creation - it can look at your existing test cases, understand the structure, and propose new ones. But always review what it generates. Bad test cases are worse than no test cases, because they lead you to make wrong decisions about your agent’s performance.
The Hidden Cost of Tool Calls
The most common pricing mistake with AI agents is forgetting that every tool call counts as a separate API request. A simple user query like “move my meeting to tomorrow” doesn’t cost one request - it costs four or five. There’s the initial request to understand the query, a tool call to fetch the task data, another to update the event, sometimes a confirmation step, and then the final response. Multiply that across every user interaction and the costs scale far beyond what you calculated.
This is how a projected $3 bill becomes $40 in two weeks with only 20 light users. The math only works once you account for the full tool-calling chain, not just the initial prompt.
Dynamic System Prompts and Tool Selection
The fix starts with a simple insight: smaller models don’t fail because they’re dumb - they fail because you’re asking them to do too much at once. Sending a 25,000-token system prompt with 17 tools to a budget model is like handing someone a 100-item instruction list and expecting perfect execution. Reduce that list to three items and the reliability goes way up.
The architecture uses an intent classification layer - a cheap model like Gemini Flash that takes the user’s message and figures out what kind of request it is. Is it a simple search? A calendar operation? A complex multi-step scheduling task? Based on that classification, the system dynamically assembles a system prompt from modular pieces (only the scheduling module for scheduling requests, only the search module for searches) and selects only the relevant tools. The full 25,000-token prompt shrinks to 2,000-5,000 tokens, and the 17-tool list drops to three or four.
Choosing Models by Complexity
With the intent classified, the system routes to different models based on complexity. Ultra-budget models like Gemini Flash handle simple searches and basic operations. Mid-tier models handle moderate tasks. Premium models like GPT-4o only get called for genuinely complex requests that require sophisticated reasoning.
The key insight is that this isn’t a compromise. Smaller models actually perform better with focused context. When a model only has three tools and a concise prompt, it makes better decisions than a more powerful model drowning in irrelevant instructions. You’re not trading accuracy for cost - you’re getting better accuracy at lower cost.
Results and Key Takeaways
The dynamic prompt and tool selection system reduced per-request cost by over 80% - from roughly 20 cents per request (after accounting for tool calls) down to less than half a cent. And reliability didn’t decrease at all, confirmed by running the full eval suite against the new architecture.
Three lessons stand out. First, dynamic system prompts and tool selection let you use much cheaper models without losing accuracy - in fact, accuracy often improves because the model has less noise to filter through. Second, two or three cheap models working together (classifier plus executor) can be more efficient and more reliable than one expensive model handling everything. Third, always factor tool calls into your cost projections from day one - they’re the hidden multiplier that turns sustainable costs into unsustainable ones.
Intro / Preview of the agent
so I've been experimenting building an AI agent for my daily planning app Ellie and I wasn't going to make a video but check this out so I have this agent I just built and I can ask it things like "What time should Siccilia and I go to the gym on Monday?" And it's able to look through both of our calendars and find the best slot and I can even ask it to create the task and time box it for me and the craziest part of this is it was built with a lot less code than you might think i did this in about 2 days and I'm going to break down exactly how I did it in this video if you're a developer and you're interested in building AI agents for your own app this is the video for you and if you're not a developer and you're just interested to see what this stuff looks like under the hood I hope that this video demystifies it for you if you're new to the channel welcome to the video my name is Chris i build productivity apps i usually focus on one productivity app per video and today we're focusing on Ellie quick context ellie is a daily planning app it's basically your to-do list and calendar combined so that's the app that we're focusing on today so back to the agent it's a chat interface where you can ask it simple things like "What's my schedule?" But it didn't stop there so as a bonus I actually found something pretty game-changing turns out I could hook this thing up to a thing called Zapier MCP so now I can even ask it things like Brian sent me a Slack message asking me to take a look at our server bills can you find it and make a task for today with details and now the agent's going to search through Slack find the message from Brian and then it creates the task in Ellie with the details in the notes so even though I never built this integration directly with Slack by giving my AI agent Zapier access it already has access to this and thousands of other tools like Gmail Notion To-Doist and a ton of different tools the agent was already good but then hooking it up to Zapier made this thing 100 times more powerful than I anticipated and I'm very excited to show you guys what that looks like so in addition to how we built the agent we're going to be covering the Zapier stuff too so definitely stick around for the rest of the video okay so let's jump
What are agents / how they work
into it what are agents and how do they work the simplest way to describe agents are they're basically AIS like chatbt but it's given a set of tools that it's allowed to use and you can give it any tools but since this is an agent specifically designed for Ellie we're going to give it tools specific to Ellie like the ability to create tasks search through users calendar update tasks all the stuff you can do in Ellie so at the core the agent is an LLM just like Claude or GPT and we give it a set of tools it's going to run through a loop and keep calling tools until it feels confident it has exactly what it needs to answer the user's request let me show you a practical example of what this loop might look like so if I ask the LLM when can I go to the gym this week the agent's first going to look at this and say I'm going to need to call a tool for this there is a get calendar events tool available let me call this first once it calls this tool it's going to look at the results and say cool I have everything I need to answer this user's question it's going to go ahead find an empty slot through the day and then answer the question for the user and then the loop basically ends there so that took one loop to do that's basically it this is the foundation of how all the agents works they go through this looping and tool calling process so now let's go see how I actually implemented this in Ellie so the first
Creating a simple chat (with an LLM)
thing I did was I actually created a very simple chat UI it's just a chat there's nothing magical here cursor did this in like three prompts so I had this chat i can ask questions but it doesn't really do anything so the first thing I did was define a simple endpoint that just hits up an LLM and then just spits a response back from the LLM and in this case I'm hitting up a service called Open Router which basically hosts a bunch of different LLMs and I can switch between them with one line of code so the reason I do this is so I can test out a bunch of different ones really quickly and we'll see more in the video why I had to do that but just know I'm calling this service instead of hitting something like OpenAI GPT directly so I have the power to switch between Claude OpenAI Gemini and all these things really easily but the process works whether you're hitting Claude or OpenAI directly or going through Open Router it's all the same it's really simple it takes in any messages that I'm inputting so could be one message like what is the largest ocean in the world and it takes in the model I want to use which in this case it's GPT40 Mini from OpenAI and it sends the request directly to Open Router to be processed by this LLM model that I chose the GPT40 mini model is going to receive this process and try to figure out okay what's the largest ocean it's going to send this response back and I'm going to send it directly to that chat interface so when I ask it what's the largest ocean in the world the response then shows up in the chat it's not really an agent cuz it doesn't have tools or anything but this is the starting point let's actually make this
Creating our first tool for the agent
into an agent we're going to add our first simple tool to make this thing functional the way it works is we call open router exactly the same way as we did without the tools and we just pass in this new parameter called tools and we have to define what tools we're going to give it here we're going to define the tool here and this is what it looks like to define a tool might seem a little complicated but I promise it's not so first you give the tool a name and in this case we're going to call this create task because it's a tool to create tasks we give it a description which is to create a new task and the description is actually really important because the LLM really looks at this description to figure out when should I use this tool how should I use this tool very important do not overlook it and then we give it the parameters so these are the things that the tool requires to run and in terms of parameters these are all just things that you can put in an LE task but the only thing that's really required so you can see it's noted here is the description so at minimum this is saying to the LLM to use this tool you need to pass in all of these parameters but the only one required is this description or title so this is the first step in defining a tool this is almost like the blueprint of the tool this tells the LLM what is this tool for and what do I need to pass in to use it okay so that makes sense but we haven't actually defined the tool like how does it actually run and here's where we actually define the tool i have a separate function called execute tool and it takes in two parameters it takes in the name of the tool which in this case is create tool and then the arguments or those parameters that we defined for the tool it's going to go through the code look for a match in the name so in this case it's going to say cool create task matches here okay this is the code that we need to actually run for this tool and then here we're able to extract those arguments or those parameters that we defined like the task description and then we're actually able to run code and here's the actual code to create the task which in this case is really simple it's hitting up a pre-made function that I already have to create a task in my back end this part is going to look really different depending on what the tool is but just note this is the actual code to create a task in my case again it's going to look completely different this is just how I defined it but this is where you'd actually put the logic of the tool and then really important you return the results of the call so that the LLM knows cool here were the results so in this case the result could be I created the task you're good to go and now the LLM can see that and then tell the user I've successfully created the task for you or maybe if it fails and something went wrong that could be in the result and it can then properly tell the user hey I tried but it didn't work do you want me to try again or something like that so
Getting our agent to use the tool we made
we defined the tools and we fed the tools into our call to the LLM to open router now we have to do one more thing which is actually handle that loop so I'm going to try my best to explain this but the way it works first it's going to call the LLM with the tools with the user request once open router is actually going to respond and in the response they're going to tell us does this require a tool call or does it not like are we good do have we already answered the user's question if it does require the tool call now we're actually going to continue and start this loop and what's important here is that I've actually made it so that we're counting how many loops there are and I've defined a maximum of three loop calls that it can do the reason I did this was for safety because in theory this thing can actually just loop infinitely which would then rack up a huge bill i would go bankrupt and then no more videos so I decide to set a hard limit of three loops just for safety purposes it's probably a little bit easier to visualize this with an example let's say that we're asking it to create a task to go to the gym today so first what's going to happen is we're going to send the request to open router to create a task to go to the gym today we're going to send it all the tools that it has available which is the ability to create a task so the LLM is going to send a response back which is actually going to contain that it cannot answer the question it needs to call the tools and specifically it wants to use the create task tool so what's going to happen is we're going to go through here and it's going to trigger this it's basically going to enter this loop because it's going to tell us that it needs to call the create task tool and it can actually call multiple tools in one go but in this case it's just going to call one it's going to loop through all the tools that it wants to call which in this case is just create task it's going to execute that tool call which again is that function that we defined previously that takes in the tool name the arguments and then actually goes and does the tool call it's going to call the tool we're going to get the results of that tool call so in this case we're sending back to the LLM that hey I successfully ran the tool call and the new go to the gym task was created today and then that follow-up response is actually also going to contain a does this follow-up require tools and what tools do we need the follow-up response is going to be something like "We created the task i have everything I need to tell the user no more follow-up so I don't need you to call any more tools so I'm going to actually just send back an empty tools array." And when it loops around and gets back to the top it's going to see that empty tool array and know "Oh great okay we don't have to call any more tools let's go exit the loop we have everything we need to answer the user now." And it's going to exit the loop and then it's going to return the final response message back to the user or back to the chat UI so hopefully you guys followed all that but it does work so now when we ask it can you create a task to buy groceries today the agent is going to recognize it needs to create a task it needs to use a tool which is the create task tool so it enters the loop calls the create task tool with the right parameters and then that tool sends a response back that says yes cool we created the task and then when the agent sees that that's successful it says great I have everything I need to answer the user let's exit the loop and then it sends this final response back to the user that I've gone ahead and created the task in Ellie and now we've created our first real agent this is an AI that has access to a real tool that can actually impact and create tasks in Ellie and now the fun actually begins we can then start creating even more tools to make this more powerful and so that's exactly
Giving our agent the ability to update tasks
what I did next the next thing I wanted to do was see if it can handle more complex scenarios so a great next test was the ability for it to update tasks which actually required two steps i needed it to first search for the specific task and then call a tool to update it so two tools would need to be called for this an example of this would be if I asked it can you move my dentist appointment up 1 hour it should be able to in theory search for the dentist appointment and then update the task by moving it 1 hour and in terms of the tool really simple so I defined a search task tool and here's the description it's able to search for a task based on description label or date range and then here are the parameters we feed in the main one that's required is this query parameter and in terms of the tool definition super easy it's basically calling this pre-existing search endpoint I already had in my back end that powers the existing Ellie search so it was very few lines of code to implement this update task was the exact same thing i created the update task honestly very similar to the create task the definition also leveraged the existing update task endpoint that I already have in my back end so again just a few lines of code here too but when you put it all together and you add those two tools to the tools array alongside the create task tool it's very powerful and just immediately work so when I ask it move my dentist appointment up 1 hour it first calls this search task tool and it feeds in dentist appointment once it gets the results of that which is the matching task it then has enough information to call the update task tool and then it changes the time of that task to move it up 1 hour then when both tools are run and we got the results from both of those it now has enough information to answer the user and it sends a message confirming I have updated your task for you since we already have the base layer
Giving our agent the ability to interact with calendars
super simple adding those next two tools so let's dial up the complexity again the next thing I wanted to do was see can it handle calendar stuff because Ellie actually has access to your Google Outlook and Apple calendars these endpoints already existed in my backend so creating tools was super easy i just repeated the same process i defined them in my tools array and then I created the implementation which was very easy leverage the existing backend code so I gave it two more tools which are get the calendars the user has access to and then a tool to fetch the calendar events from those calendars so now if I ask it when am I free tomorrow afternoon what the agent does in this case is it checks okay I need to get the list of available calendars i need to figure out which one is the primary calendar and then return that and then I need to then call the get calendar events tool that we just defined and then it fetches the events for that primary calendar it looks through the events and then it finds the next availability in the calendar and then returns it to the user in the chat and it worked surprisingly well so when I asked this question it's able to correctly return when I'm free so now here's where things got a little bit more tricky because then the next thing I wanted to do was ask it can it coordinate between two users so if I ask it when should Siccilia and I go to the gym on Monday in theory if I asked at this it should be able to correctly pull my calendar pull Cecilia's calendar and then pull the events for both those calendars with the second tool and then check where are the free gaps however for some reason it was kind of struggling with this i think it just kind of got confused because it didn't really know okay which ones are Cecilia's calendars which ones are my calendars i was hoping it would pick it up because technically the ID of the calendar has her name and my name in those IDs i was hoping it would pick it up but it didn't so I actually had to modify the system prompt a little bit and I had to give it a little bit more guidance on how to handle that specific request and so here's the prompt when I said when a user asks you to find a good time or to check mutual availability and here's some specific examples i want you to follow this workflow and I defined the specific workflow on what order to call the tools how to distinguish between two calendars if possible and it did take a lot of trial and error but with this prompt it was able to consistently start fulfilling this request which is cool because this is actually an action that I do every single day instead of having to toggle both our calendars on and then look for a slot myself I can just ask it that genuinely does save a little bit of time each day and then obviously I can just follow up and tell it yeah can you go ahead and create that task for me and since we already have the create task tool just by doing that follow-up it's able to create the task and correctly
Giving our agent the ability to timebox our day
put it at the right slot in the calendar and the last thing I had it handled was the ability to time box your day i actually thought this one would be easy because in theory it has all the tools that it needs access to but it was actually really challenging for it to understand i think the problem was I was using GPT40 Mini to do all this stuff and it's a pretty small model still very powerful but I guess it's still not that good at handling things like creating an agenda based off multiple calendar events so I decided to create a specific tool that calls a more powerful LLM that is able to actually handle that and so now that's one of the tools so that's another cool thing you can do which is define tools that are actually just other LLMs that you can call and you can chain them and it just becomes really powerful to be able to orchestrate these complex workflows so as I promised
Giving our agent access to 7000+ tools with Zapier
before in the intro this is where things actually got pretty interesting so I could have stopped there but I'd actually heard about something called Zapier MCP if you're not familiar with Zapier they are an automation platform that lets you hook up basically any app into any other app to create automations so Zapier is really powerful they're already hooked into thousands of apps and the premise of their new service Zapier MCP is that it'll allow AI agents to hook into the thousands of apps that Zapier supports which sounded really interesting to me as I was working on this agent so I wanted to see how easy that would be and first a huge thank you to Zapier for actually sponsoring this video i'm a big power user of Zapier i use it to automate a ton of workflows for my businesses i've already created an Ellie integration for Zapier but if you're interested in automating workflows or as you'll see integrating Zapier with your AI agents definitely go check them out below i'll leave a link in the description but thank you to them for sponsoring this video so going back to Zapier MCP the premise is it allows your agent to connect with thousands of apps with very minimal integration code let me show you what the end result is and what this enables so I've hooked up Zapier to my agent and I've authenticated with Zapier and it now has access to any tools that I give it through Zapier so in this case I've given it two tools the ability to search in Slack and the ability to send messages so now what I can do is say "Brian sent me a Slack message asking me to look over our server bills can you find it and make a task today with the details so what it's going to do is build on the existing agent we made with those existing tools but it's going to add these two Slack tools from Zapier automatically to that list of tools we have available with literally no work on my end so now when I send this message the agent runs through the loop and says "Okay I can't answer this question what tools do I have?" Oh great i have a search Slack tool it's going to call this it's then going to search for the Slack message from Brian it's going to find this and now it has everything it needs to create that task for me and then it creates the task and you can see it appears in Ellie which is amazing because it was such minimal code to do this here's another example of something cool you summarize all the tasks that I worked on today and send a summary in the Slack channel i have a tool to get all the tasks in but I also have this tool from Zapier to then send messages to a Slack channel let me use both of these tools and now when I run this it's able to get the tasks and it calls Zapier and sends the message to the Slack channel you get the picture here the implications are now without having to build direct integrations into these tools this agent can basically access anything now it has access to GitHub to Linear to to-doist to Slack all of these things so if I tell it something like get all my linear tasks for today and create them as tasks in Ellie it'll automatically go ahead and do that so I've heard about MCPs in the past i'd tried them in the past but this is probably one of the first genuinely useful cases I have seen which is to supercharge existing agents by allowing them to connect to other tools and again with very minimal code that is an incredible use case and it just made this agent 10 times more powerful already but let me show you how I set this up in case you wanted to do this
Making my agent into an MCP client
yourself so what I had to do to connect to the Zapper MCP server was to make my chat and agent into an MCP client just like Claude and Cursor are there's really good documentation online from Anthropic so I basically followed the JavaScript version of this installed everything on my server and now the chat is an MCP client and it can call these MCP servers like Zapier directly and here's the code where we're initializing this MCP client and we're loading up the tools from the Zapier MCP server like the Slack tools that we gave it we inject it into the existing array of Ellie tools that we defined and so now it just lives alongside those tools and can be called anytime and when a tool call is made from Zapier it sends the request to Zapier they handle it and they send the response back just like we do with those other tools we defined in Ellie it might seem overwhelming and daunting but overall I think this took about 30 minutes probably under 100 lines of code to be honest and I was able to just feed the documentation into cursor and it was able to get it in a couple prompts so don't feel intimidated by this if you're building an AI agent definitely go check it out try to implement it see if it makes your agent more powerful but it's something I'm going to keep exploring here and I'm really curious to see what services users are using when I ship this to
Next steps
production to be honest I think this feature is kind of done and ready to be tested i'm going to open up a closed beta for a couple Ellie users so they can test it and I can see how useful this is find some edge cases i would love to port this thing to the iPhone version of Ellie so I can just open it up probably add some dictation capability so I can just ask it can you tell me when I should go to the gym with Cecilia tomorrow and then just have it go do it for me i think that's the dream so I will be releasing this on the iPhone version as well which I'm really excited about but I am extremely impressed by how useful this thing is just after working on it for 2 days i've
Why I failed in the past (at building agents)
tried to build an LA agent a bunch of times in the past but every single time the LLM was not good enough to execute the tasks consistently or it was just way too expensive to do this at scale but now I think they're smart enough and way more affordable to the point where I can finally release this to users if you're a developer building an app and you've been considering building an AI agent I hope that this video helps and
Conclusion and thanks for watching
pushes you to do it i think this is a really cool interface that I hope comes to more applications and if you're not a developer I hope this video demystifies agents for you if you like this content check out my Instagram and Tik Tok i post almost every other day about building productivity apps and obviously if you like this content don't forget to subscribe but thank you guys so much for watching and I'll see you guys in the next video
Intro / What we are covering
So, I've been building this AI agent that can control my calendar and my tasks. And I hinted in my last video that I built out an entire test evaluation suite to make sure that the AI performance does not go down. This is a common question I get. How are you testing your AI agent before you deploy it to production? And honestly, before I used to just wing it. I'd make a change to the prompt. I'd test it two or three times and if everything looked good, I'd just ship it to production. What can go wrong by doing that? Turns out a lot can go wrong and it gets worse as your app gets more complex. So, I made a system to help me test. And this is actually what a lot of larger AI companies are doing too. The system tests a bunch of scenarios and then grades how well the AI did. So then if I make a change to the prompt or the model, we can see if any of those scenarios fail when I make that change. Then we can dive deeper, debug and figure out why. Then we make sure before we deploy nothing breaks in production. If you're new, welcome to the channel. My name is Chris and I build productivity apps. And today we're going to go over my AI testing suite. So here's what I used to do. And maybe you're in the same boat, too. You make a change to your prompt, make a change to your model, and then you test maybe two or three things like creating a task, creating a label, and then if everything looks good, you just ship it to production. But then you start getting reports from users that's things are starting to break. So when you fix the ability to create a task, maybe you accidentally broke the ability to add labels to tasks. This is exactly what was happening to me. And it got worse as my agent got more complex. I started handling time box, calendars, recurring tasks. The numbers of scenarios grew exponentially. And then the number of things that started breaking as I made changes also grew exponentially. I was spending way more time manually testing this thing than actually shipping features. I knew that there was a better way and then I discovered it and it's
AI evals and how they work
something called AI evals. Evals are automated tests for AI, but they're way different from normal software tests. With regular code, you usually use some sort of unit test. And it works like this. If you expect 2 + 2 equals 4, it's pretty easy to check if it passed or failed. If the output does not match four, it's a fail. And if it does, then the test passed. Very simple. But with AI, it's different. When you ask it, can you go create a task? it might respond with, "I have created the task for you." Or, "Your new task has been created. Your workout has been scheduled." Technically, these are all correct, but with traditional tests, the test might fail because the text does not exactly match. So, here's what I built at a high level. And I'll go over a little bit more detail, and we'll go through some code towards the end of the video. So, stick around if you're interested in seeing that. But first, I write test cases in plain English, and they kind of look like this. When I say create a task to go to the gym, I fully expect it to create a task that has the word gym or workout or routine somewhere in it. and it should be scheduled for today. Then when I run my test, the system is going to spin up a dummy test account with test data, run my command through the AI agent just like a user would in real life, capture every single detail like the tools that were called, the state of the account before and after the test, and then we have a separate AI agent act as a judge and look at all of the output and see how closely did we match what was expected. And it provides a score usually from 0 to 10 depending on how close it got. And what's really cool is that I built the system to actually output an HTML report. So I can visualize and see what score did it got? Why did it get this score? And if I want I even have all of the things that were logged so I can debug a little bit further. And what I've done is I've created a bunch of these tests and I can run all of them at the same time. I can just run a single command and then I get this beautiful report and then I can tell did something go wrong? Do we have to debug it or is this good to deploy? Okay, so let me show you guys what this
Example of a test eval
test looks like. So I can actually just run this command and what it's going to do is it's going to run one of my tests for me. So this is a test called label task unlabeled. This is a test just to see can the AI agent label my tasks. When I run it, what it's going to do is first it's going to set up this test account with dummy data and I see it here happening in real time. Then it's going to start logging a bunch of stuff like what is the account look like at the time before we run the test. Then it's actually going to run the actual test. So it's going to try to ask the AI agent, can you go label the tasks? And we should see on the left, there we go. It just labeled all the tasks correctly. Then when the test is complete, it's going to first capture everything that it just did. Then it's going to reset that test account for me. What happens here? We can see that this test actually passed. It correctly labeled all the tasks, which we saw on the left. We can see the average latency. This is how long it took, the average cost. And then we actually have this report. So it'll show me we ran one test. It passed. I can see in this case that Google Gemini 2.5 Flash was actually selected here. It used one tool. And then there's a bunch of information here for debugging purposes just in case I want to see what exactly happened, what tools were called, what was the exact prompt, what was the exact response. I can see all of this information here. So that's actually how this works at a high level. And then there's a command where I can run all of the tests at once, which is going to take a long time, but it'll go through all of this and put it in this really nice report for me. And here's an example of what the report looks like when all of the tests are run. So we can see here what models were chosen. We can see if any of the tests actually failed or passed. So here's where it gets
Using evals to compare models
interesting. Let's say I wanted to test the new GPT5 mini model that just came out yesterday and I want to see how it performs in my application because I've heard really good things. Now, what I can do is actually just swap the model out, run the test suite, and now we can benchmark it against another model. So, I ran my test against Gemini 2.5 Flash and GPT5 Mini. And I had Cloud Code just interpret the results from the HTML report. Really interesting to see that GPT5 Mini actually had way more accuracy than Gemini Flash. But you can also see here that the latency was way higher. Gemini Flash was completing tasks in about 5 seconds where 5 Minute on average took about 30 seconds. This might not seem like a big deal, but for a chat application like Ellie Assistant, the latency actually does matter. Imagine if the user asked move my task and then they have to wait 30 seconds. They're not going to like that. So, the Eval system helps me conduct these tests really quickly and I can compare things like accuracy, latency, and cost at a glance. So, something that I can do is I
Giving Claude Code access to my evals
can actually ask Claude code because my evaluation suite just runs in the terminal. It can actually control this, look at the logs and do a debugging for me. I can actually ask it, can you get the latest report and tell me about any tests that failed? And so now Cloud Code is going to go run that search for me. By the way, a lot of people are asking what tool I'm using for dictation. And this is actually a tool called Whisper Flow. They're actually a sponsor of the channel for this month. So a huge shout out to them for sponsoring the channel. But even if they weren't sponsoring, this is genuinely the best dictation tool I have found. And the reason I use Whisper Flow is because it lets me write much more detailed prompts than I would if I was just typing out a few sentences. It feels a lot more natural to explain things just directly by dictating it to Claude Code. And Whisper Flow is specifically really accurate with developer terminology compared to other dictation tools. So I could say something like MongoDB Superbase and it actually gets these things right. So I'll leave a link to them in the description if you want to check them out. But that's the tool I'm using. So back to Cloud Code, I asked it, can you go check on the test that failed? And it actually found that this test did fail. It got a 5 out of 10. Here's the input. And then it identified, okay, these were the issues. This is why the LLM judge gave it a five out of 10. It said that these ones all passed. So now what I can actually ask it is, great. Can you debug why this happened based on all of the logs that you have? Now I'm going to
Sponsor: Wispr Flow
just let Cloud Code do its thing. It has all of the context that it needs. It has all the tool calls. It has the account state before and after. Hopefully, it should be able to figure out why this went wrong. Now it's done. and it actually gave me a bunch of suggestions. So, it thinks that there's something wrong with the intent classification system that I set up, which is probably true. I think shopping list might be too vague because I think it's actually a grocery list I set up. So, I intentionally made this one challenging by making the task called grocery, but I told the AI, can you add it to my shopping list? So, I might have to tweak the prompt a little bit to give it more guidance on these kind of synonyms and stuff. But really cool that Claude Code actually can make these suggestions and then if I want I can have it implement the solution and then just rerun the test and see if it actually works. So that's the test suite at a high level. That's how it works and you could probably honestly just ask Cloud Code to
How this works (code walkthrough)
implement something similar for you. So let's go a little bit deeper and I'll show you guys how this works and how I set this up. The way I set up the test cases is I have them defined as JSON objects and I just have this folder with all of the different test cases that I have. Each of them has an ID for the test, the name of the test, a description, which is actually really important if you're going to use cloud code because then it understands what you're trying to achieve in the test, what the input is. This is actually what the user is going to input into the AI, which in this case it's label my tasks. And then this is the setup, which is the initial data. This is what tasks, labels, and list should exist in that dummy account when we're going to run the test. And then we have the evaluation criteria. So this is actually what the LLM judge is going to be looking and scoring. So again, for label my tasks, this is the input. This is what we want to test. This is the account setup. We're going to be giving it a bunch of tasks that are completely unlabeled and then a bunch of labels that we want to apply to the tasks. And then this is the criteria that the LLM judge is going to be looking at. In this case, it must label exactly six tasks scheduled for today with the exact labels. Paint the walls in the kids' bedroom as a personal label. So all of these different things. And you can get as detailed as you want here. So, for example, if you just tell it as long as six tasks are labeled, this is a pass. I could have done it that way. But in my case, I want to make sure that it's actually applying the right labels. Like the personal one needs to go to this paint the walls in the kids' bedroom one. So, I have specified this in the criteria. So, that's how I set up the test. And then I've set up a bunch of different logging. It's just a bunch of utility functions to actually log all the data. And so when we're in test mode, when I hit the assistant API endpoint, I'm going to capture all of the tools that were called, a snapshot of the account state before and after, and then even things like latency and how many tokens were used so we can calculate the cost. You'll probably do the same thing, just make a bunch of these utility functions, and then insert them throughout the code of your AI application. And then here's the important part, which is this LLM judge. This is the actual prompt that I'm using. All I'm doing is saying, here's the user's request. Here's the evaluation criteria, which again, all the stuff is being pulled from that JSON configuration that I showed you guys. And then all of the stuff that I logged, like the execution trace, which is all the tools that are called, what messages were sent, and then the before and after state of that dummy account before and after the test is run. And then I just gave it some instructions on go ahead, see how closely we followed the criteria, grade it from 0 to 10, and then make sure to return the score, the reasoning, all the actions. All of this stuff is returned. This is the stuff that is going to be put in the HTML report. In my case, my prompt is actually very basic and pretty simple. Some companies choose to do more complex logging, like they could actually specify what is the expected account state at the end of this or what is the expected tool calls, but in my case, it was working fine, so I didn't really do that, but I might beef that up in the future. The key takeaway here is that you want to define what the AI is expected to do. Capture what it actually does and then use another LLM as a judge to compare what you're expecting versus what actually happened and giving it a score. And you can make this as complex or as simple as you want. I'll show you one more cool thing that you can do with Cloud Code and a system like this. So
Using Claude Code to write tests (fast)
instead of having Cloud Code just run the evaluations, what I can do is actually have it create tests for me. So I can ask it, hey, can you create a test for me that will test if the user is able to create a new label? What it's actually going to do is look at the existing test cases. So if you see here, that's actually what it's doing. It's looking through the test cases. It's trying to understand how those are structured. And then it's going to try to come up with a new test to be able to test what I'm asking. It's actually looking at this and trying to figure out how our labels created in the system. And now it's actually going to propose a new test case here. The important thing I want to stress though is to doublech checkck the test that it's creating and don't just blindly accept it because if you're working with bad test cases, then there really is no point and even worse, you're probably going to be making the wrong decisions. But this is a really good way to speed this up. My recommendation is to spin up at least 10 to 15 very realistic test cases and if possible something that tests edge cases, things that are very likely to trip up your AI application. So why should you care about this? Well,
Why you need evals (and why it matters)
if you're building anything with AI, you need evals. You need to be testing this. primary reason is that manual testing does not scale. And as your app gets more complex, it really doesn't scale. And more than likely, as new models come out, you're going to want to be able to test them really quickly. And again, the manual testing is not going to cut it. There are 100 different ways to set up
Conclusion and discord :)
eval. So, I hope that this gives you a good idea of how you're going to set it up in your application. And if you're building with AI, I have actually just set up an AI builder Discord community that I'm going to leave a link to in the description. I'll be in there sharing and building stuff. So, check that out if you're interested. And if you like this content, check out my Instagram and Tik Tok. I post almost every other day about building productivity apps. And obviously, if you like this content, don't forget to subscribe. But thank you guys so much for watching and I'll see you guys in the next video.
Intro / What we are covering
So, I wasn't planning on making this video for a few months, but my AI agent costs have gotten so out of control that I had no choice but to drop everything and solve this immediately. I got hit with a $40 bill in the last 2 weeks when I calculated, it would cost $3. And that's only with 20 people using it pretty lightly. So, we're talking about a cost that was 10 times higher than anticipated and if it continues and I get more users, I would probably go bankrupt. If you're new here, welcome to the channel. My name is Chris and I build productivity apps. This is actually the third video in my series about building a custom AI agent from scratch. This video is about how I reduce the cost by about 80%. We're going to go a little bit deeper in this video and instead of just showing you the solution, I'll actually show you how I architected the solution using Claude Code as a research assistant, and I'll walk you guys through the code so you can see what this looks like in practice. If you haven't seen the other two videos, definitely go check those out. I did a little bit of cost optimization in the last one, but to be honest, I wasn't too concerned with it, and I was more focused on making sure that it was actually useful. So, I kind of pushed cost optimization off because I knew I could deal with it later. But after seeing this bill and realizing how badly I miscalculated things, I think that time is now. Okay, so let me tell
Why my costs were so high
you what went wrong. During development, I had calculated that the cost would be about 2 to 4 cents per user, which was technically correct, but there was a massive oversight. I did not factor in that tool calls actually counted as a request, too. So when a user asks, "Move my meeting to tomorrow." Here's what actually happens. There's an initial request to understand the query. A tool is called to fetch the task data. A tool is called to update the event. In some cases, another tool is called to confirm with the user. And then there's a final response back to the user. So what I thought would be one request actually ended up being four to five. And this was happening with every single user request. So it made total sense why my cost would end up being 10 times higher. So I spent some time trying to figure out what was going on. And it looked like the number one reason was I'm using GPT40 as the model for almost every single call. And this is a very expensive model. Here's a chart comparing some of the popular models. And as you can see, GPT40 is one of the most expensive. During testing, I did try to use cheaper models like GPT4 Mini and Gemini Flash, but they just kept failing. They'd mess up time zones, they'd call the wrong tools, and sometimes they just completely misunderstood the user request. From my testing, most of the models were failing about 20% of the time, while GPT4 was failing about 2% of the time. So, I thought I had no choice but to use this expensive model. But then I realized something. Maybe these smaller, cheaper models aren't bad. Maybe I'm just asking them to do too much. Imagine if you're asking someone to housesit for you. You give them this massive list like, "Here's Luna's medication. Take out the trash. Here's how the thermostat works. Here's how the dishwasher works. Imagine there's a hundred items on this list. Even if they have the list right in front of them, there is a chance that when it's time to go walk Luna and they're like, "What time was I supposed to walk Luna?" They still have to scan through this list and there's a chance that they might accidentally miss it. Now, imagine if that list was smaller and there's only three tasks on that list. Now, there's a way higher chance that they would execute those things more reliably. And from my experience, I think AI models operate the same way. The more tools and instructions you give it, the harder it is for it to reliably execute them, especially the smaller models. And to be honest, the solution
The solution (dynamic system prompt and tools)
is pretty simple at a high level. It's to dynamically generate a system prompt and a tool list rather than sending in a giant system prompt and all 17 tools. If you only send it exactly what it needs to do its job, the smaller models have a way higher chance of actually executing the request. With that solution in mind, I think it's more interesting if I actually just show you guys how I came up with the technical architecture using Claude Code. Before we jump into it, I am very excited to say a huge thank you to Anthropic for actually sponsoring this video. If you've been following my channel, you know that I'm a big advocate of Claude Code, and I'd be using them even if they weren't sponsoring. So, if you want to check out Claude Code, which is the tool I use to do the research and to actually implement the code, I will leave a link in the description. This is something that I haven't seen a lot of people talk about and that's using cloud code as a research partner and to help you architect out a technical solution. So
Using Claude Code to architect a solution
let me show you guys how I did that and then we'll go into the implementation. What I'm going to do is I actually use cloud code inside of cursor. So I open up my terminal on the right side and then I just type in claude and this is actually how you run cloud code. So what I did in my case was I asked it, can you analyze this codebase and come up with a technical solution for me that will reduce the system prompt, the tool calling and allow us to use way cheaper models. Please ultraink. And the reason I do the ultra think is if you didn't know, it's a special keyword that actually gets Claude code to think a little bit harder. So I'm going to ask it to do this and it's going to think for a pretty long time. And this is something I actually don't see a lot of people talking about is using Cloud Code as a research assistant to bounce ideas off of and just as a sanity check. Sometimes you can ask it, hey, what do you think of this strategy? Is there something I'm not thinking about? I do this all the time when I'm thinking about architecting solutions. I do it to double check security. I do it to double check efficiency. There's a lot of ways you can do this, but in this case, I'm using it to figure out what is the best strategy to get the context down so I can use these cheaper models. It actually shows its thinking in real time. And I actually really like that it's catching some of this stuff. It identified that the system prompt is really massive. I am using very expensive models like 40 and claude sonnet that there's 17 tools. It just finished and it came up with this. I'm going to ask it. Can you put this in a markdown file so I can review? I just don't want to read it in here. Okay. So, it came up with these two documents. I'm going to show you guys what it did. It did a pricing comparison for some of the different models which is really great. And it actually recommend using 40 Gemini 2.0 Flash DeepSeek. I'm going to double check some of these cost and models. We might use different ones, but really cool that it did the research and got the pricing. It recommend to do some smart model selection and classify it according to complexity. And then we'll go to the other file here. And here's the architecture that it's proposing. So it says to use an intent classification layer using Gemini Flash, which will be very, very cheap. It gave us some intent types like whether this is a search operation if you're going to do an analysis. Here's the architecture for building a dynamic system prompt. So it's to break things down into these modules. So instead of having one giant system prompt, you have a bunch of modules and depending on what the intent is, we're going to take different models, piece them together, and build a dynamic system prompt. It's going to reduce it from 25,000 tokens to about 2 to 5,000, which is a huge reduction. Same thing for the dynamic tool calling, we're going to break these up into different groups. And then depending on what the intent or what the user is asking, we will pull different tools. So we don't send all 17 anymore. We're just going to send a couple. And this will reduce it by about 50 to 70%. So it wants us to actually choose the specific model depending on how complex the request is. So we have ultra budget models like Gemini Flash and then premium models like 40 which are only to be used when absolutely needed when we think this is a very complex request. We came up with this analysis. It shows the implementation details. So we're going to have one orchestrate request function and it looks like we're going to classify the intent. We're going to build that dynamic system prompt and tooling and then select the model and then return all of it. It has a couple more thoughts here but this is a really good game plan. This is actually how I usually start a lot of complex architecture. I have cloud code come up with it and a minimum this is a really good starting point even if I don't fully use it. So that's a tip for you guys. It's to use cloud code as a research partner when you're coming up with this architecture. So once I had
Going over the solution/implementation
this technical architecture the next thing I did was I just asked Cloud Code to go ahead and implement it. And I know it sounds crazy but it actually did implement this in one shot. And I think it's because the plan was so well laid out. It knew exactly what to execute. So, let me show you guys what this dynamic system prompt and tool system looks like in case you want to implement it in your own apps. Let me walk you guys through the implementation of this. I promise it's actually not that bad. We now have this new function that we passed every single request into. So, the first thing we do is we classify what kind of request the user's message is. So, is it a complicated one? Is it a time boxing one? Once we have that, then we can go build this dynamic system prompt. Then we can select what tools we need to call. And then we select what model are we going to use for the agent. So all three of these things are going to happen right after we classify what kind of message is the user sending here. So let's go jump into it. Let's first look at this intent classification. So this is that smaller model I'm talking about that takes the user's request and figures out what kind of request is this? What model should we be using? What tool should we be calling? Like this is the main thing that determines that. We're taking in the user's request and then we're basically trying to figure out what kind of message is this? Is this a complex scheduling thing? Is this an analysis question? And then we're just figuring out what complex it is, what kind of tools we need, what kind of model do we think we're going to need here. After we've analyzed the user's request, we're going to send in this object, which those other three functions are then going to use. And here's the instructions that I gave it to help it classify the user intent. So, we're going to send this off to the smaller, cheaper model, which in this case is Gemini 2.0 Flash. And then it's going to respond with this object. So, now we have all of this metadata of what kind of request this is, what models we're going to be needing, what tools we're going to be needing. And now we're actually going to use it to build the system prompt, to select the tools, and to select the main model we're going to use. For building the system prompt, it's very simple. First, what we're going to do is we're going to send in things that are kind of non-negotiable depending. So, this is like date information. This is whether we need to confirm things with the user. These are just non-negotiables. And then here's how we're actually building the system prompt. So what I did was I took my massive system prompt that I had before and I broke it up into modules. And depending on what kind of modules are needed for the request, we're only going to send those in. So for example, I have a module for deletion stuff, for scheduling stuff, for time zone stuff. We're basically picking from this list of modules and we're combining them all to make our really nice smaller system prompt. So that's what's going on here. Based on the intent, we're just going to dynamically build the system prompt from these modules. Selecting the tools is kind of the same way. Based on the intent and the type of request, I also have a list of all the different tools. If it's a basic search operation, I'm just going to give it all the search tools. If it's a scheduling operation, I'm going to give it all of these tools. And if it's a combo of both, so if it's we require searching and scheduling, it's going to be feeding in both. But that's basically what this select tools is calling. It's just pulling from this list of almost tool categories. and it's just building out what tools are we going to need to meet this user's request. So, these are the two big ones. It's building the system prompt and selecting the tools. And then the last one is actually selecting which model we're going to be using. And this one's pretty simple. We're just mapping. If it's a very simple request, we're going to be using Gemini 2.0 Flash. If it's a very premium request that's going to need a very expensive model, we're going to be using 40. Just ignore the fact that this is an array, by the way. I'm actually just using the first value in the array. I just set this up right now because I do plan on allowing for fallbacks in case, let's say GPT40 is down. Then what's going to happen is it's going to move on to the next model as a fallback. I have not implemented that yet, but I'm prepping for that. So that's why it's in an array right now. And so that's how we're selecting the model. And once all of this is combined and we have the tools, we have the models, we have the dynamic prompt, we're going to send all this to the LLM. But the result is that that context is now substantially smaller. We're talking like 80% smaller. And because of that, the smaller, cheaper models are going to have a way easier time understanding and actually following the instruction. So, I know this looks complicated, but I promise this is a really simple system. And yes, I actually had Claude Code implement all of this one shot. Obviously, I reviewed it, but I really like the way that it did it here. So, what happened after implementing these
Results of the changes
changes? Cost actually dropped from about 2 to 4 cents per request, which after tool calling ended up being about like 20 cents per request, down to less than half a cent per request, and most of the times even lower than that. on average over an 80% cost reduction. Now, the big question is, did switching to these cheaper models hurt accuracy at all? And in this case, the answer is no. And the way I tested this was by building out an evaluation system, basically automated tests that run a bunch of scenarios that make sure that the agent is running properly. The reliability did not decrease because all of those tests ended up passing. And again, I think it's because now that each model has exactly what it needs and nothing more, there's just a higher chance that these smaller models can actually follow the instructions properly. I won't go in depth on this evaluation suite, but honestly could be a really good video. So, if you want to see that, please comment below and I can make another video. The key takeaway
Key takeaways (3 lessons)
here is that this dynamic system prompt and tool calling method is a very good alternative if you have a massive system prompt that you think that a model would not be able to follow properly. If you do this, you can get away using much cheaper models. Second thing is I was a little bit worried about using two to three smaller cheaper models versus using one more expensive model because of potential speed and cost concerns. But in reality, actually using two to three cheaper smaller models could be more efficient and cheaper than using one large one. And that's what happened in this case. And the third is if you're dealing with tool calling with agents, make sure to factor that into the cost.
Conclusion & thanks for watching :)
The funny thing is, I'm really glad that this happened because it pushed me to learn some of these new techniques, which seem very obvious in hindsight. I'm sure I'm going to be learning more optimization techniques as I go, so definitely expect more videos. But if you like this content, check out my Instagram and Tik Tok. I post almost every other day about building productivity apps. And obviously if you like this content, don't forget to subscribe. But thank you guys so much for watching and I'll see you guys in the next video.