AI Agents Fundamentals, v2

Running Evaluations

Scott Moss
Netflix
AI Agents Fundamentals, v2

Lesson Description

The "Running Evaluations" Lesson is part of the full, AI Agents Fundamentals, v2 course featured in this preview video. Here's what you'd learn in this lesson:

Scott walks through writing an evaluation, covering scores, mocked data, and executors. He demonstrates creating an evaluation file for file tools, setting up an executor with single-turn mocks, and using evaluators to convert outputs into quantitative scores.

Preview

Transcript from the "Running Evaluations" Lesson

[00:00:00]
>> Scott Moss:--- Let's move on to actually writing the eval. So we have the scores, we have the mocks data, we have the executors, which are like dynamic runs, right? So now we can put those all together and run an evaluation. So let's just do that. And let's start with the file tools evaluation. So we go here. We want to make in the evals folder a new file, we'll call it filetools.eval.ts. Let's get busy. So we'll say evaluate from Laminar.

[00:00:43]
And then we want to import all of our stuff from the evaluator that we just made, which we're only gonna have one of those because I only made one, but if you paste the other ones in, go ahead. I only have the tool selection score one. Feel free to paste the other ones if you want. I have those there again as examples of what is possible, but we don't need those right now. We're gonna import this type from types.

[00:01:14]
It's gonna be the eval data and the eval target. That's right. So we got that. And then we need to import the data set that we have for this specific eval, which is the file tool eval, so we'll import that so we can say import data sets, you can call it whatever you want, so it's a JSON file from data slash file agent.json and then you can do this with, and then you can say like JSON. Oh I'm sorry, type JSON to hint to TypeScript that this is JSON.

[00:01:56]
And then this last one that we didn't make that I already made because it's a utility function, and it's just gonna be from executors. Oh no, wait, no, we did make this. Did we make this one? No, hold on, it's gonna be single turn executor. Oh, I just gave it a different name. I'm like, wait, did we make this? Yeah, single turn executor, but this one's got with mocks, so I guess just to be clear, I'll say with mocks because maybe you might make one that doesn't have mocks.

[00:02:34]
You shouldn't, at least not in this case because like we're like touching like the file system and stuff, but like maybe your tools don't do that, and you just don't want to use mocks. I don't know. I'm not doing that. OK. From here we can say executor, let's go ahead and make our executor, which is going to take in some data, it's gonna get that array of data, it's gonna be eval data and for every one of those pieces of data, it's going to make a call to single turn with mock with that data.

[00:03:18]
That's our executor. And then from there, we can just call evaluate like this. You can pass in a name here, so it doesn't make a random name on the dashboard, but you don't have to. First we're gonna say the data set and I'm just gonna say as any so it can leave me alone. You pass in the executor that we just made like that. And then we need the evaluators. Right, so for the evaluators, like I said, I only made the score one, so you can call it whatever you want.

[00:03:56]
I have this one, I'm gonna call it selection score. Right? And like I said, it's gonna take the output from the executor, it's gonna take the expected output from the data in the JSON file. And it's just going to return the result of us calling in this case the total selection score, but first I'm gonna say like, hey, you know, if target category. This is why I like to put the category on some of this data is like secondary.

[00:04:34]
Target that category secondary. Then I'm just gonna return one, and I'll tell you why in a second. Let me just any of that thing. OK, so I only wanna check this, like, I don't wanna penalize a secondary data set for getting the tool selection wrong because it's a secondary, it's a secondary prompt, it's not something we're trying to build for, so I don't wanna give this a zero. If the data that this thing is testing is for secondary prompts.

[00:05:05]
You might also argue I could just filter those out here with a filter on the data set and just only do the ones that aren't secondary. I could do that too. That's totally fine, but I don't wanna penalize, like, if you get it right. I guess that's like why I wanna keep it because like if you get it right out of one, that's great. If you get it wrong, don't penalize us because that's a secondary one anyway, so we didn't expect you, we didn't expect the agent to solve for that, but for primary ones or golden use cases where like somebody typed in a very specific thing and we should have figured that out.

[00:05:33]
I wanna penalize you if you don't get this tool selection right. Does that make sense? There's like literally no right or wrong way to do this. I can't tell you if I've seen anyone do this. This is just something I made up, right? You just kind of have to figure out. You gotta get in there and see what matters. So for me this would matter is that like, hey, for a primary data set, I'm not giving you no leeway on tool selection.

[00:05:58]
For a secondary prompt I'll reward you if you got it right. If you got it wrong, that's OK. I don't really care. I don't want that to mess up my eval scores because something that we aren't solving for wasn't perfect, but I do want my eval score to be diminished if something we are solving for didn't do well. That makes sense, right? And then just return, what is it? Tool selection score. That takes in the output and the target.

[00:06:36]
OK. I'm just gonna do that too. And if you added the other ones, you can put them here too. You can put whatever name you want here. This is the name that's gonna show up in the dashboard as the scorer in the Laminar dashboard. Cool. I'm gonna stop here. Any questions? So a few things, a lot of eval stuff is hyper related to the framework that you're using, but they're all pretty much the same thing.

[00:07:08]
There's always some executor, right? They have different names, but it's literally give us an implementation of the thing that you're evaling. It could be one turn of your agent, it could be the whole agent, it could be one sub-agent, just whatever the thing that you're trying to eval, give us the implementation of it. The thing that's going to take an input and generate an output, that's the executor.

[00:07:28]
You probably want to make that dynamic so you can pass in things, right? Every eval framework has that. Every framework has data. They usually call them data sets. These are collected, synthetic, wherever you got your data from, that's what these are. They're always input, output, and expected. Always. Every eval framework, every eval methodology is that, OK? And then there's always scorers. In this case, they're called evaluators.

[00:08:00]
These are functions that take outputs and convert them to quantitative scores that we can chart and plot and feel good about, you know, they're like, oh, this number went up. That's every single one, it's just the names are different, so don't get confused if you go use Brain Trust or some other thing, you're like, oh, it's different, it's all the same, they just have different names. It's literally the same thing no matter what you use.

[00:08:24]
If you use Python, it's all the same, right? And then the last thing is they're all wrapped in what's called an experiment. So this one evaluation right here that we just wrote, this one whole thing is an experiment, and you'll have many experiments. Right? You have many of them. That's why it's nice to add something like group name here. And I'll call this like file tools selection so that way if I make another experiment that's another variation of this, but let's say the data is different or the model that I gave it is different or the scores that I gave it is different.

[00:09:08]
I can group them all together to see averages, changes over time. I wanna see how much better is this experiment that uses this V2 version of this executor that we built than the previous version that we had last year. I want to compare them. I'm gonna see which one's better. So I put them in the same group name, so when I go look in the dashboard, it'll put them in one place, right? So I can compare them.

[00:09:31]
That's what group name is for. Every framework that I've seen has something like this too. It's not always called group name. It might be called like experiment ID or experiment name or something like that, but it's all the same. Any questions? Trust me when I tell you. If you aren't doing evals for at least 40% of the time of agent development, you are, you're not doing it right. This will take up more.

[00:10:07]
This will take up so much time and it should. This is the process of making an agent. It is this, the other 30% of the time is improving it so this is better. And then, what's my math? What's that? That's 70 right there? And then, yeah, the remaining time is like, you know, fixing bugs, making tools, like making tools is easy. It's just functions, it's just code, it's stuff you've always done, it's a tool, it's a function, it's not hard.

[00:10:29]
None of those tools are hard to make. You can npm install a tool, it's done. But it's these things that are hard, these things that are new, we've never had to do this as like traditional software engineers. This is the part that you need to spend your time and if you're just going off vibes, that's great. No one's going to think it's good, I promise you, because it's going to be bad. I don't care what Google told you about how good Gemini is.

[00:10:54]
That shit is going to be bad. You have to eval. You can eval it, it's still gonna be bad because you have to look at the evals and improve on it. So if you think about what you need in your toolbox in the future, in my opinion is being able to be creative enough to come up with and define the metrics and the evals that you need for the system that you're building. And that's assuming you know what system you're building, but maybe someone else is the architect behind that, but assuming you know the system, what are you evaling?

[00:11:22]
What are the metrics that matter and how do we get those metrics, right? And then like how do we change our architecture so it's flexible enough to where like maybe you don't have to do what we did here, which was like I had to make an executor, right, which is essentially right here. This is essentially our run function, but what if our run function was flexible enough to just be used in the eval?

[00:11:43]
How do we do that? How do we have our run function take in arguments to where like it can run in production, but it could also just run in the eval and it could just turn the tools off and mock them, right? Like, how do we change our architecture so we're not like duplicating our code and stuff like that because right now this run function is just like super easy, but if it was super complicated, imagine having to replicate that in all your evals.

[00:12:08]
And you wanna change a system prompt. You gotta go change it everywhere now, like it's insane, right? So that and then the last thing, the last thing you need in your toolbox is like what are the techniques that I can lean on to improve the agent system to see these evals increase, right? It's like prompt engineering, that's the one everybody knows, but where? Is it system prompt? Is it tools descriptions?

[00:12:35]
Is it hints from tool calls? Is it, you know, we need some personalization from users. Is it RAG? Like, what is it? That's just prompt engineering. Then it's like fine tuning. Do we need to look into fine tuning now? Is that a thing that we need to? Or is it, hey, we're actually really good, we just need a better model, but the trade-off is it might be slower and more expensive, and the token window is smaller, what does that do?

[00:13:00]
Like, you just need to know all the different things in your toolbox that you can like pull out to run experiments to see the number go up. You just put on a lab coat, you're gonna be doing science all day. That's all you're gonna be doing. That is the most valuable skill set. I think anyone needs to have for building production ready agents. If you can do that, you'll probably get a job anywhere.

[00:13:20]
That is the hardest skill set. It is not what people think it is, like, I made a call to an LLM and I added tools. Anybody can do that. Anybody can do that. This is stuff nobody can do, right? This is what gets you the job in my opinion, because this is what pays. Like companies that do this and spend a lot of time on this are actually selling to enterprise companies and getting contracts and deals because they've evaluated the hell out of their system and they feel good at night about it.

[00:13:47]
So if there are people out there that can do that, you'll probably get a job, so. If I was going to focus on stuff, that's what we're focusing on, yes. Do you have any strategies for testing different models with the evals in this app? Do you set up a separate eval for different models, or do you just swap out the model and run the same evals? It depends on if you want to compare them across each other over an extended amount of time.

[00:14:15]
So, for instance, if you want to compare the frontier Gemini versus the frontier Anthropic versus the frontier ChatGPT, then yeah, you could just, you could, you probably want to compare them, so what you would do is you'd probably make one eval that just compares the models, and then in that data set as you can see, I have, where did I put that? I'm not using it in here, but if you go look at my executor it has the ability to pass in different things here, right?

[00:14:56]
Like model, like data.config.model. So like if you put a config property here you could put a model here. Right. You put whatever model you want here. So I would have data that is like all the same data but only the models are different, and I'll put that in one evaluation, so when I go look at it in a chart, I can plot the same inputs, outputs, and the expectations, but the only thing that changed was the model, and then I'll just do that for every frontier model and then when a new model comes out for that provider, I'll update it to that frontier and I'll see what happens, right?

[00:15:24]
And then that way I can track it over time. You can also make your own benchmark. I think that's probably a better use case. I think tracking the models like great for something like this, but depending on your system, you probably wanted to like do a full benchmark of like create a fake environment in which your agent will run in. Like what does your agent have access to? Fake that. Like in our case, our agent has access to essentially a computer, so throw this agent in a sandbox and give it an input and see and then inspect the environment.

[00:15:53]
And based on that inspection, you can give it a score, right? You can be like, we gave it these tasks, but it's like it's like giving an agent a take home test really. It's like you put it in the environment and then you say, here's a bunch of tasks or one big task. It should have made these files, these files should have had this in it, it should have deleted this, we should have seen this in the bash history, like it should have did all these things.

[00:16:20]
That's like a real world example of like everything the agent should have did. You can also do that as well, and I think that's the future of evaluations. I call those simulations. Those are like the next level of evaluations just like simulating what an agent can do and then how do you quantify that. That's a lot harder to do because not all environments are, not every environment is under your control.

[00:16:47]
You might be talking to third party APIs that you know, rate limit and break and cost money and stuff like that. So it's like, it's kind of weird, but in our case, we could do it. We could just spit up a VM somewhere and do it, so. Yeah. Any other questions? So when you're running these tests, you're running a certain number of them, like what, how do you think about sample size for your testing? The more the better, right?

[00:17:08]
That's a great question. I think the answer is there's no limit, there's no cap, you want to do as much as you can. It's more about the quality than it is the quantity, I think. I do think there's a threshold on like how much you need to be useful, but that really also depends on the scope of your agent. Like if your agent only could, if your agent can only check weather then you probably don't need that many pieces of data.

[00:17:34]
But if your agent is like, oh we can do everything, then like, OK, well, you will never have enough because every time your agent does something new, you got to keep adding new evals, right? Like imagine you had an agent that could allow a user to add MCP servers, so you literally don't know what your agent is gonna do because what it can do isn't defined by you, it's defined by the user using the agent when they add tools to it.

[00:18:01]
So you couldn't even write evals for something that doesn't exist yet. So then you now have to write evals against like just basic ability to like select any generic tool or things like that and in that case, there really isn't a limit on like how much data you can have, but you probably want to do enough to feel comfortable that like we've thrown enough stuff from all over the place that if somebody comes in here and attaches an MCP server we feel confident that it'll work because we've tested so many different angles of throwing like really good tools to really bad tool descriptions to thousands of tools added to only one tool added to there's 2, there's 20 tools and 3 of them have the same description like we just tried everything and this is where we are, we've handled a lot of this stuff so at that point it's just like.

[00:18:54]
What makes you feel good, I guess, and then from there it's just collecting the data in real time, so therefore it always grows and at some point you'll start to like sampling that data because it's too much, you can't train on all of it, right? So there isn't a point where you would necessarily think about using statistics to guide what you're doing just to manage costs and time. I think that you can do an overall analysis on the data that you have to help you evaluate your evaluation, right?

[00:19:22]
Like you can say, let's do an analysis on all this data and the scores that we're scoring for and let's see if we can optimize this to maybe score for something more important or if we can remove these scores or is there a better way for us to get more confidence and like that would be a statistical analysis on like discovering what's more important and that's why I think the job of figuring out what the scores are is the hardest and most difficult job because you really need to be like a data scientist.

[00:00:00]
You just gotta know how to use data very well and like it's gotta be like your language because coming up with that type of stuff is beyond me. Like I just couldn't do it, but the answer is yes. I've just, I wouldn't even know where to start with that.

Learn Straight from the Experts Who Shape the Modern Web

  • 250+
    In-depth Courses
  • Industry Leading Experts
  • 24
    Learning Paths
  • Live Interactive Workshops
Get Unlimited Access Now