Analyze Eval Results

Netflix

Lesson Description

The "Analyze Eval Results" Lesson is part of the full, AI Agents Fundamentals, v2 course featured in this preview video. Here's what you'd learn in this lesson:

Scott covers running and interpreting evaluations, including average scores and analyzing individual runs to understand agent behavior. He discusses naming strategies for experiments, examining successes and failures, forming hypotheses for improvement, and emphasizes the essential role of human expertise in the iterative evaluation process.

Join Now

Preview

Transcript from the "Analyze Eval Results" Lesson

[00:00:00]
>> Scott Moss:All right, I do have one for shell stuff, but it's mostly all the same, we don't need to write that. Let's just run the eval. So, to run the eval, you have everything set up, go to the package.json, you'll see a couple eval commands in here. You'll see eval, which just runs all evals. And we have like specific evals for specific ones. I don't know if these are going to work. I actually just wrote these assuming that's how that works.

[00:00:33]
Now I'm thinking about it, I've never tested that. So just run this one, just run that one. And if that doesn't work, let me know. So we can just do npm run, npm run eval. Cool, and if your stuff is set up right, it's saying out of 9 because that's how many pieces of data I had in that array. There were 9 samples, so each sample is being ran through this evaluation, right? So for every sample you have in this data set, this will be ran once.

[00:01:15]
So if I have 100,000 samples, this will be ran 100,000 times. Does that make sense? OK. So, there it is, boom, I have an average selection score and it averages it out. I have an average selection score of 88%. It's pretty damn good I would say. So, and we can click this link and we can go, why did it open up over here? Oh, let me try to bring it over here. Hold on. Also looks like my Laminar thing just died.

[00:01:56]
Uh, let me just. I was talking so good about you, like, why are you going to do me like that? Um. Let's go to evaluations. See if it actually ran it. Uh, yeah, yeah, it did, it did run it. I don't know why the dashboard crashed, but there it is. Here's my evaluation. Let's dig into this a little more. Let's just, there we go. So, the other reason, the reason you see these other ones because I was running it earlier with those scores, if you didn't add those other scores, you won't see these.

[00:02:26]
You'd only see the scorer that you added in my example that you just saw me live code, I only have selection score. I don't have tools avoided and tools selected. I already ran this for like the last week, so that's why you see those here. You will not see those here, OK, unless you added them because I didn't just live so let me clarify that. And these names, like I said, if you don't give it a name, it will generate a name for, I guess you could think of this as like I think it was like the experiment version, right?

[00:03:05]
Like I like to say like, this is, you know, V1, V2, V3, or. I think it's really, I think what you should do in my opinion, let me see, do they have a description? They, I don't think they have it. Let me see before I say this, let me see if they have a description. I don't think they have a description. Yeah, so I think what you want to do here is what I typically do on the name. As I put like the things that I changed since the last version, right?

[00:03:31]
So like what did I change on this experiment versus the last time I ran this, right? Because typically you make changes, you run it, you make changes, you run it, right? Or you might just run it, run it, run it, run it, run it, and then average all those out. Those are called samples I guess. I don't know how to change that in Laminar, but most other eval tools allow you to like turn up how many runs you want to do.

[00:03:56]
So even though I have 9 data points, run each one data point 10 times, so that'd be 90 points that I want you to run. So you could typically do that. But anyway, name, you can come up with a name, but typically this is where I would just describe the changes that I made. So like if the only thing I changed from this version to this version was like updated system prompt to include instructions on picking files better.

[00:04:20]
That's what I would put in the name. You can also just give it a name. And then I don't know, map those names to somewhere in your code that has a description so you can go see like, oh, what was this, what was the change? You could think it was like a commit message I guess, right? It's kind of like that, it's like, what is the thing that you changed, right? So that's kind of how I think about it.

[00:04:44]
Uh, here are all the scores, you can see, uh. At least in this, this is the average, so, and. OK. That's an ugly looking graph. Let's go to, let's go to P90, that's the one that really matters. So, you can see right here, like my P90 for selection score is 90, right? So like, I'm sorry, 1. So I've had, and let's go look at it right here if I click on this. If I go click on it, I can go see all the individual runs, so 0 through 8, so all 9 runs.

[00:05:20]
I can see the data that was passed into it, right? The trace all the way down so I can see, OK, here all, here is how the agent responded. It responded with a list of tool calls, which I expected. It responded with, hey, I want to read the file. I want to read the package.json, OK? Um. And I expected it to call the read file. This is a golden category, so I'm very strict about it, and the prompt was, can you read the contents of the package.json?

[00:05:52]
And I'm going to give you all these tools read, write, list, and delete. I expect you to call a read file and guess what? It did. That's exactly what it did. So that's why this one, scored, let's see, yeah, this one scored a 1. Perfect. It did exactly what I wanted to do. Let's go look at one that scored a 0. So if we go to this one, because I think this is the only one that didn't get a 1 because we're still at 80.89%, so if I go look at this one that scored a 0, aha.

[00:06:22]
Let's see what happened here. Did I mess up the eval or did, is our agent really bad? So I can go look at this and I can be like, oh, this one did no tool calls. I expected it to call delete file. It is a golden path, so I'm very strict about this, so like. Our users are whatever prompt I put in here, so remove the old backup.txt file, our users are saying that a lot because I said this is golden, this is a direct path, people use this a lot and it didn't call the thing, huh, even though and I and I'm looking, I'm like, well I did everything right.

[00:07:01]
I expected to call delete file. Score is golden. I gave it these tools. It didn't call it though, right? So this is where you would go and be like, huh, there's something wrong with the delete file description we need to test this, right? So what I would go do is I would make a hypothesis I'm like I think we can probably improve on the delete file description because that's the only input that it has in this evaluation.

[00:07:25]
I only gave it a description of the delete file and I gave it the arguments for the delete file and the descriptions for those arguments, that's all I gave it, right? And I guess I gave it a system prompt too. So I would want to probably change one or three of the one or two or three of those things. So I would take a guess based off my experience and what I've seen change, based on what I've researched.

[00:07:45]
I want to do a small incremental change and I want to change the prompt to see if I can get this to be included, right? And this is assuming I actually set this test up right because like I might look here and find out that like I forgot to add the delete file to the list of available tools so that's why I didn't pick it. That was my mistake. I need to go fix that evaluation, right? But from what I'm seeing, I don't think that's the case.

[00:08:14]
Let me actually let me verify because there is this other thing here called tool names and I'm like wait, tool calls, tool names, right file. Oh no, this is just the ones that I selected, yeah. So yeah, I did all the right things. It just didn't do it, right? And this is where you can get really. Really meta, right? If you're like, I'm just trying to figure out why that might be the case, you can take all of this.

[00:08:40]
Right, you can point Claude at your executor, you can give it this, and you can be like, why didn't I pick the right tool? And it can give you ideas on why it didn't pick the right tool and how you might better adjust that, and that's a good starting point is to use a bigger model than the one you use to evaluate this to give you some ideas on how you might improve it. As a, it might not be right, but it definitely gives you some ideas like, hey, maybe change this or oh I see, maybe you forgot to do this, right?

[00:09:19]
And it's a great starting point, boom, make that change, run the evaluation again, see if it goes up. Once you're comfortable with it, cool, commit those changes into GitHub. Merge You just improved the agent. Congrats. Right? Like when I was working on my startup, you could not submit anything to GitHub without an eval or a set of experiments that you ran that shows that there was no drop in quality.

[00:09:40]
If you did, it would not get merged, it just it wasn't going to happen. You, it either had to stay consistent or it had to go up. There was, you could not have a drop in quality or like there were some things that we didn't mind if it dropped because those things didn't matter, but there was still a threshold. It was like, oh, it can't drop more than like 2%, it drops more than 2%. Like for instance, we knew that like if we switched out this better, slower, more expensive model to a slightly less good but faster and cheaper model, we knew that we were going to see a decrease in this score because the model quality.

[00:10:15]
The model strength isn't that high, but that's a trade-off. We're trading off speed for quality here, so we're OK with that. So run that a couple of times, that's going to be lower, set that as a new baseline, keep it moving, right? It's the same thing if you're doing a snapshot testing on UIs, two pictures on top of each other, but your change is removing something. So, yeah, you expected that, and now this is the new baseline of the one with the removed element.

[00:10:48]
Yeah, yeah, I think it's interesting. I took the code sample from the website which had, I think, GPT-4o mini. I ran it 4 times this evaluation, it came back true for delete all times. So it's kind of interesting, yeah, so like switch model, look at that, you know, so. Exactly, yeah, and this is where it's like, the more sample size, the better. So it's like how much more do you need to feel confident, you know, like.

[00:11:15]
I don't know, like, you could be mathematical about it. You'd be like, well, you know, on average we get 10 amount of user inputs a day, so if we could have a data set that is at least half of that rate. And that and the output of that was on average, you know, over 75, 80%, given that we're not dealing with like life critical, we're not some life critical agent. I guess I'll feel good about that, right?

[00:11:36]
And then like as usage goes up, then we want more data, like, I don't know, you'd be mathematical about it, there really isn't the right answer and but maybe there isn't, I'm just not some statistician, I don't know, but like, that's typically how I think about it. Mine failed on list files. Uh, it said, what's it, the prompt was, what's in this project, show me around. I looked at the description for list files and it said, list all the files in a directory.

[00:12:03]
So I changed the description to list all the files in a directory or project, and then I ran it again and it ran. Is that the kind of like fix that you would, OK, yeah, exactly, 100%. It seems weird to do that, but. Yeah, something like that could be a small win. That's prompted prompt engineering is usually the first tool belt, the first thing you reach out of your tool belt and be like, huh, how do I improve this?

[00:12:28]
Let's do some prompt engineering. That's typically the first thing you do. If that's all you ever have to do, congrats, you have really good AI engineers and they made it really easy for you. Chances are that's not going to be the case. You're going to have to change some architectural stuff like models, like, you know, different reasoning methods. It'll be prompt engineering won't be enough, like you would have like reached a maxima of like how much prompt engineering you can do, especially if you're using like, um, in this case I'm using a reasoning model.

[00:13:08]
This is why I'm thinking my thing failed because the reasoning model, according to the guys from OpenAI, you don't want to tell it what to do. If you actually tell it what to do, it's actually worse. You want it to figure it out by reasoning, so in this case I would what I would do is I would probably like pull back on some of the super descriptive prompts and then what I would do. Is let me see, I would go to this executor and I would say provider options and I would say OpenAI.

[00:13:45]
And I would say, reasoning effort. And I would say like, high. I'm like, OK, cool. Now really think about this, right? Because I think by default it's low. I think the reasoning effort is low for this. So, low, medium, high, I'm like, OK, high, really think about this, and then, you know, I would I would play around in here. I'm like, let me see, the delete file, what about this delete a file at the given path, the path to the file that you want to delete.

[00:14:10]
I don't know that I feel pretty confident about that, so I would just, I would just change that and then I'm going to run that again and see what happens, right? And. I'm pretty sure I'm going to get another one that's going to fail or something like, it's going to be like some hot potato stuff where it's like whack a mole, you know, you're like plugging a leak over here, they're like this other thing starts leaking.

[00:14:33]
Now you ran out of feet and hands, you don't know what to do anymore. It's kind of like that. That's literally what it feels like making an agent and it never doesn't feel like that. So like, this is why people who can do this stuff will always be employed because the moment those people take a vacation, it's over. It's over. Everything's going to leak, everything's going to explode. These people have to be working around the clock, around the clock, especially the more agency you give an agent, like the more autonomy that agent has, the more you need those people for sure, so.

[00:15:09]
Sticky pigs, oh man, what does it come up with this stuff, so. Let's see, looks like the same one, just delete file, yeah, so. This one's still broken. At least it's consistent. I like consistency because now I'm like, OK, oh, it's, I believe it's us now. I truly believe it's something that I did because this is, I mean, it's only 2 samples, but like that's still pretty consistent, so I might run this, I don't know, 100 times just to see and if it's.

[00:15:48]
You know, majority of the times comes back as a failure, then I'm like, it's definitely us. Let's fix it, right? Yes. Does the duration tell you anything? It seems like the failed test is always the long one. Uh, that's a great question. Let's look at the span, so. Tool calls selecting any tool names. Span. I was hoping that if it errored out, it might say something, but. Yeah, I think I was looking for errors, but I think if it errored out, it would put here, but duration.

[00:16:29]
That is a good point. It shouldn't. Because all these things run in parallel, they don't run after each other. They run in parallel, but this one is, wait, let me see. Oh, this one took 31 seconds, yeah, so. It did take a long time though, but like. I should have figured it out, honestly, but at least now I know that like, I feel good that like, this was us. Remove the old backup file. So let's say this, what if I say.

[00:16:54]
Let's do something stupid, right? What if I go ahead and say, let's go to the data. Let's look for remove delete. The old backup.txt file, right? Because in our tool description we use the word delete, we don't use the word remove. So I'm going to change one of these. I'm going to change the prompt to say delete. Seeing how it's a golden data set, let's be very specific. Whereas remove might be too finicky and that should be considered a secondary one.

[00:17:22]
Who knows, given the model that we're using. I don't know. So let's run that again. You will be doing this what I'm doing all day. This is what you'll be doing in a more sophisticated way for sure, like you can automate a lot of this stuff, but yeah. This is This is AI engineering 101. You know, there's the infrastructure side too which I love and I think it's really cool, but none of that means anything without this.

[00:17:57]
Like I don't care what your infrastructure is doing and how cool your microservices are serving. If this shit ain't good, it ain't good. Look, it's still bad, so. At least it's consistent though. I like that, so we can go back. And go back to our evaluations, press it, screw, where do they come up with this stuff? And wait, is this a different one? No, I still the same one, so yeah, something about this one it doesn't like, so we would have to, oh wait, what's this?

[00:18:23]
It said list files. Oh, so it did do something different this time. It looks like it tried to see if the file was there first before it deleted it, which kind of makes sense. I don't want to like punish it for that, but like, that's not what I told you to do. I told you to delete the file. I'm guessing these models, seeing how they're trained on like coding, are very scared of doing something like deleting a file, so it's like trying its best not to do it.

[00:18:49]
So I, my hypothesis is if I put something in the system from this like ignore everything about deleting. If someone says delete, you delete it. Don't worry about deleting something. There's another system that asks for approvals first. You don't need to worry about it. They will be prompted before they delete it. Just feel free with suggesting delete. And to tell the model that it's OK. I would imagine OpenAI trained this model to be very cautious of destructive things like this.

[00:19:24]
This is like approaching, asking the model to like, make me a bomb. Like it's like approaching that. That's my hypothesis. Yeah, Christopher in the chat says they added, if the user asked you to perform a file operation, read, write, list, delete, you should use the appropriate tool to do so. And that helped make it work for them. So you would never have known that if you didn't eval, so you got to eval this stuff, you know what I mean?

[00:19:50]
So. Cool. Uh, any questions? Yes, where are those, those questions coming from Laminar? They're in the evals data folder. Evals data. Yeah, there's 3 .json files in there. The one that we're using is file tools .json. Nice, yep, and add more in here. Like, I mean you could feed this to Claude and like make some more, really confuse this thing, right, and just go crazy like. This is a playground for you all to like experiment with experiments on an agent.

[00:20:15]
Give us some tools, go crazy, make some evals. And honestly, evaling LLMs you learn so much about like how they behave like and what they do and like it's it like I could, I could look at the output of like a Gemini or a Claude or a GPT and tell you exactly which model it came from most of the time because I've just evaled them so many times and like I, that just sounds like something Claude would say like it's Claude's like such a kiss ass like that's that's Claude.

[00:20:48]
I can tell like. I know it. I've seen that before. Oh, that's chat GPT. They're using way too many hyphens, way, way too many hyphens. That's definitely chat GPT bullet points and hyphens, that's definitely chat GPT. So you just, you just know, evals, this is the skill set. Please listen. Like you got to learn this stuff. This is this is going to be the future, like if you want a, like, you know, 300K, 400K job without becoming an AI researcher or scientist.

[00:21:10]
This stuff, this is the stuff, right? You go work at any company building agent, which is going to be every company. This stuff, this is the stuff, so trust me on that. Other than that, yeah, I had more examples in the code of like the shell tool evals. Feel free to like add those, implement those if you want. It's just more of the same. There's no new things in there, it's just more of the same. That's why I didn't feel obligated to do it.

[00:21:38]
Don't do the multi turn stuff, don't touch that. That's going to be different. I got to walk you through that. That's actually quite complicated. This is, this is nothing compared to that, OK? So we're going to walk through that and then we'll get to the agent loop because right now, our agent is not very agent, you know, like. It doesn't do anything. It doesn't even like respond back here, so it just says done, at least on mine.

[00:22:00]
So we will do an agent loop and make this conversational. It's the same thing you would do when in the generate text where you say step and then give it a count and it does it internally, we're going to do that manually. We're going to make that loop manually, right? We're going to handle tool calls ourselves manually, we'll do approvals, so you have to approve a tool before it does a thing. And we'll do streaming.

[00:22:23]
Right now we've just been generating a text where we just wait for the whole output to show up and it just jump scares you, we're going to stream the text token by token instead, which does complicate stuff, but it's like kind of a necessary evil in my opinion. You really need to stream stuff, like it's just a better experience. You've mentioned evals as being one of these like core strengths for humans.

[00:22:44]
It seems like an obvious case to try and automate it. Do you see like not enough investment to fix that problem or solve that problem, or are you seeing that the investments that are happening, they're just not productive. It's just too hard to solve automating the evals. I think, OK, it's like this. If, if anything, the only people that can automate stuff are people that can do it very well, right?

[00:23:07]
Or at least they have to help the people who can automate stuff. You need a subject matter expert. There just aren't a lot of SMEs that know this stuff very well, so it's very hard to automate. Like I can automate my bespoke eval process for the agent that I built because it's my process and I'm the expert of that process, but I probably couldn't automate something generally speaking for everybody who wanted to do something.

[00:23:34]
I just don't know enough and even my automation is probably not optimized because it's limited by what I know about evals at the time and it's probably room to grow there. I'm not like a statistician. Yes. It can be automated and to some degree when we get into the multi-turn evals, you'll see we'll use LLM as a judge as a scorer, where the LLM will look at the output and give us a number between like 1 and 10 and then we can get a score from that.

[00:23:57]
So I guess kind of it's automated, but, but not really. I mean. Yeah, yeah, it's because it's like. Even do you eval that thing? Do you eval the judges? So it's like, and then we do eval the, you know, so it's like at some point, where does this happen, right? At the end of the day it all goes to humans. At the end of the day there's a subject matter expert that's like, all right. Let me help you out with these evals.

[00:24:17]
Let me help you automate that. There's just no way around it. It's just never going to happen. So that's what I've seen in a lot of the companies that I invested in. That's where they struggled. They were like, we need help with evals and like everything came back to like, all right, let's go hire like 30 experts and have them sit down and go through all the data that you have and label this and just pay them whatever they need for a few weeks to do that and that's all they did and the evals were great.

[00:24:41]
They were amazing. The product was so much better, they knew exactly what to do, it was so good, but anything before that where they were trying to eval themselves, they were not the subject matter experts. They knew how to make the agent, but they didn't know. It's like if you had to interview, you're an engineer, I imagine you had to interview someone in an attorney, you know, your company was hiring an attorney, so they sent you to go interview them.

[00:25:09]
You wouldn't even know how to qualify this person as a good attorney or not, but even though you've done interviews and you know how to do interviews and you know how this company works, you don't know anything about this role, so you wouldn't be a good interviewer. So like even though you can make agents and stuff and you're really good at that and you understand how to make evals, you've done them, you don't know a damn thing about what this agent is supposed to do on an expert level, so because of that, humans have to be involved and those humans have to be experts.

[00:25:48]
No. Any other questions? Yes, this is, my company has, I don't know if it's an enterprise copilot license, and one of the things you can do there is, and everybody on our team does is like you can build an agent with a couple of clicks it makes a new agent and you give it a couple of documents and then you call it a whatever agent, but also I can just click and create it and not add anything it still seems to work.

[00:26:10]
So what's like the difference between that and then what we're building here? Is there a difference? Uh, the diff I could talk about this so long. The difference between those two things, not a lot, and not because what they've built is so sophisticated, but because what we built here is so basic, right? We built something very basic here, but everything starts off very basic. The, we build bones, these are very strong bones, but without like proper investment in infrastructure which allows different UX.

[00:26:39]
You know, different strategies around gathering context. Like for instance, you said you can upload a file and do stuff. OK, they're doing RAG, right? They're, they allow you to like upload a file and they do like some type of semantic query on that content, maybe like a hybrid search and they do like a BM25 like text search or something like that, and they do like re-ranking. So like, that stuff's not hard, that stuff's been solved already without code.

[00:27:06]
It's like all the startups that did that, I can't believe, yeah, they're probably dead now, but like. It's not that much difficult, but that's because. Those, you would never use those agents for mission critical things, right? Like those are chatbots that you type something in and you wait and you see, you see it respond. You would never just have that thing go off and do something fully autonomous, right?

[00:27:30]
Take the agency slider, turn it all the way up, not that thing. That thing is not going to be good. It's, it is not. It is too generic to be really good at the one thing you want it to be really good at. Somebody has to sit down and eval the hell out of that thing and optimize everything from the types of files that you upload, how it's retrieved, like, you know, the different examples, you know, in the system prompt, you know, a few shot, multi-shot, different stuff like that for it to be somewhat reliable.

[00:27:58]
So the trade-off is that like it's easy to make because you're going to use it on stuff that doesn't, it's not going to get you in trouble, it's not going to lose the company money, but if you want to do something like we want to make a full-time SRE agent that sits in the background and watches incident logs and errors and stuff like that and it like helps the on-call people do stuff. You're not going to make it with that.

[00:28:23]
Like that shit is not going to work. You're going to have to like that that is a whole product that like that that's someone's startup. It's just making that because the amount of this work that's going to go into it and the talent you need on your team that understands like you know, site reliability engineering and that type like it's just massive. You just couldn't do it enough. You couldn't just be like, here's all the documentation in the world on being an SRE, you know, kung fu, go, like it's not going to do it, right?

[00:28:59]
It might do it one time, maybe and it's never going to do it again. So the reason you, they allow you to do that stuff because they, it's OK if that thing doesn't work because it's cheap, it's easy, it's less. Uh, it doesn't have a, it's not put in a place of like high criticality, and if it messes up, you can just prompt it again, you're OK with it. So that's why those things are slightly different, but you can build that on top of this for sure.

[00:29:29]
I was wondering on those tools, you know, we're given those little text descriptions and that. You know, gets vectorized imagine or it's doing LLM's using that, but on that level on the reg, like can we get more sophisticated on what we dump in there? I mean, can we give it a vectorized representation like, oh, I ran this and this is for chat GPT so let's give it to try to put more data into the matching process than just a text description for the tool selection.

[00:30:03]
Uh, there are so many strategies around this, I don't even know where to start, like, so, one of the things I did at one point, I forgot where I did this, but we had a problem with too many tools, so I did something very similar to what you described, which was like, OK, let's make tool selection dynamic so the agent can install tools, right? So I gave it a toolbox. I gave it two tools, search for tool, install tool, those are the only tools that it had and then what I had to do was create some system internally in that agent that was really good at taking like a user input.

[00:30:35]
And figuring out, what tools it would need and then use this search tool back and forth to like kind of poke and prod at like, oh OK, I see what tools we have available. It looks like we're going to need this Gmail read tool, this other tool, and this other tool. So it'll return the tool IDs for that and then the agent would be like, oh, this thing returned all the tool IDs, install these these tool IDs and all it would do is it would go take those tool IDs and save them in the database for that run, right?

[00:00:00]
And then every time it looped, it would look in the database for those IDs, go to the toolbox and grab the tool definitions and add it to the tools object for that run. Uh, so they were always there dynamically and then the agent could use just those tools and then if it didn't need it anymore, it'll just remove it from the database. So that was like one suggestion and that search was a vector, so I would it would be a semantic search across the descriptions, but also I would add in examples with the tools of being like, not only just the description and the inputs and the description of the inputs, but here are some examples that people have said that would make you want to pick this tool, right?

[00:00:00]
I would give you examples so like the agent can like search across all that and be like, oh, here are the top 5 tools I think you need. So like any given time there's only like 5 tools versus like 50 or something like that, right? So. But there's many ways to do that for sure, yeah.

Learn Straight from the Experts Who Shape the Modern Web

250+
In-depth Courses
Industry Leading Experts
24
Learning Paths
Live Interactive Workshops

Get Unlimited Access Now