Coding the Eval

Netflix

Lesson Description

The "Coding the Eval" Lesson is part of the full, AI Agents Fundamentals, v2 course featured in this preview video. Here's what you'd learn in this lesson:

Scott demonstrates creating a multi-turn agent evaluation, including importing functions, setting up a mock executor, and considering various scenarios. He emphasizes using mock data early and running the evaluation to assess agent performance.

Join Now

Preview

Transcript from the "Coding the Eval" Lesson

[00:00:00]
>> Scott Moss:Cool. And now that we have that, let's make our actual eval. So to do that, what we want to do is go to evals, make a new one here. We'll call this multi... Um, actually, let's call it agent. I'll keep it the same: multi-turn.eval.ts. There we go. Because we already did all the hard work, the eval is going to be pretty simple. It's really not a lot to do in here. So let's just import our evaluate function from Laminar, like that.

[00:00:44]
Import all the stuff that we just made and the evaluators, which would be like... Uh, if you're on this branch that I have, you'll have order correct, you'll have tools avoided. You don't need these. The one that we really care about is the LLM as a judge, the one that we made. These other ones were additional ones that we could have done earlier in the lesson, but we only did the tool scoring one, so you don't need these.

[00:01:15]
You just won't be scoring for those if you don't have them. Import some types from types. These will be the multi-turn eval data, the multi-turn data entry, and the multi-turn... Uh, result. Oh no, I'm sorry, we need the target. That's the one that we need. There we go. From there, let's get our data set, which I should already have here in your data folder in the evals folder, so multi-turn.json.

[00:02:05]
So we'll get that: import dataset from data/agent-multi-turn, and then from here we can say with type... Uh, JSON. So now we've got our data set. Let's import our mocks from the executor. Or I'm sorry, our mock executor, so mock or multi... What did I call it? Yeah, multi-turn-mocks. Can you import that for me, please? Oh, there it is. I didn't see it. I don't know why. Okay, multi-turn with mock.

[00:02:58]
Let's make our executor. It's just going to be a simple pass-through function, right? So it's going to have this data, which is going to be a multi-turn eval data, and it's just going to return the multi-turn with mocks and the data. Why am I making this versus just using this? Well, the reason I make this is if I wanted to do some manipulation here on the data first, like filter something out or do something, I can do it here before I pass it to my eval.

[00:03:39]
And then lastly, just our evaluate. So evaluate. Pretty simple. You can put a name just like before, give it a data property. I'm going to say the data set, like so. And I'll just say as any. If you type this, you'll be able to see those types in the other functions below. I'll say executor is the executor, evaluators. Again, if you have the tool order and the tools avoided, you can add those, but really the only one that we want to see is the output quality, or you can call it whatever you want.

[00:04:16]
I'm calling it output quality, and this is where we're going to use our judge, right? So we get the output, we get the targets. Every evaluator gets these two arguments. And then from here we can say if there was no target, then let's not even waste an LLM call. There's nothing to test here. It's just return one. There's no target on this data. The judge can't score if there's no target. In this case, LLM as a judge, and we want to return...

[00:04:55]
Uh, we want to put in the output and the target. And of course... Yes, I know, you want some types. Yeah, if you have the other ones, you can add those too. It's not going to break or change anything. It's just another score, right? And then down here we want to add in a config. We will say project API key: process.env. I actually don't think we need to do this, but there's a scenario where you can run this without the Laminar CLI just by like TSX the file or whatever.

[00:05:37]
This will make sure that this runs. So we got that. The one that we probably do want, though, is the group name. So what do we want to call this set of experiments? We'll just call this agent-multi-turn, whatever you want to call it. That way when it shows on the dashboard and we keep making different versions of this, they're all grouped together and we can see the scores over time. Cool. If you want to look at some of these test cases just so you can see what's going on, I only have a few in here, but you see this first one: "Hey, read the package they sent and tell me the project name." So the mock tools that we're going to give it is the read file and the shell.

[00:06:21]
The read file has this description, right? The parameters it expects is the path, and then the mock return. It's going to mock, you know, what you might see in a package.json essentially. And then there's like a shell tool here and it says "Execute shell commands and return its output." The reason I did this one is because I want to test because technically you could just call the read file and you could read the file, but you could also do like cat inside of a shell and get the contents that way.

[00:06:49]
But I really wanted to use the read file tool. I don't want to use a bash command to read the file. I wanted to use the read file tool, so I'm testing to see which one it does, and then I could persuade in some other way. Like if I see that it's using this shell tool all the time, I might go and read file, I'm like, "Always prioritize this tool to read over every other tool," right? Or I might go into this and like, "Don't use this tool to read a file, use the read file tool instead," right?

[00:07:20]
Or I might put that in the system prompt or something else. Who knows, right? Or I might put that in my loop somewhere where, "Oh, I detected that it wants to run the shell command and I saw that the command that it's running is cat. Stop that and instead swap it with a tool called read a file instead," right? Like there's so many ways you could do this, but this would tell me whether or not it's good at picking the one that I want, right?

[00:07:52]
And then the target, you know, the original task, expected tool order. I only expect you to call read file. I forbid you to call shell, and the mock, here are the tool results, right? And then in this case, category. This is just so like I can tag things with different scenarios that I'm testing this multi-agent for later. Any metadata that can help me on the dashboard, and that's it. So that's one use case.

[00:08:16]
Here's another one that's like mid-conversation. So like, "Hey, I'm working with Node, help me understand it." The agent says, "I'm happy to help you understand. What would you like to know?" And then I say, "List the files in the source directory and then read the main entry point," right? So I mock some tools called list files and read file. So this says list all the files, give you the directory, and it just returns a list of files, right, in some random format.

[00:08:48]
Read file just returns a file. The path doesn't really matter. So what I would expect the agent to do in this case is I hope it calls list files, and then that it sees the entry point file, which is index.js. It should see this, right? And then it's like, "I'm going to call read file with the path of index.js," and it should then see this, right? So I expected to see list files and then read files, and that's what the mock tool results are going to say.

[00:09:18]
And, you know, the other stuff is just like metadata and description. So as you can see, you can get pretty creative with this stuff and like how you set it up. You really have to think like, "What would it do? How would it do it, and what do I need to say?" So like creating this data... I think creating the mock data or the synthetic data early on really helps you as the person making the agent think about like, "What are the use cases I'm solving for and like what is this supposed to do?" It's a really good exercise, which is why I like to start with evals first, which is why I had you all doing that on the third lesson.

[00:09:56]
It's like let's just write evals first because I think you really need to think about this stuff, right? So yeah, and then there's just some more in here. So, um, and then like always, just like before, we have this eval command. You can just run that. So npm run... You know, we'll do that. I only have three data sets for the multi-turn agent because as you see, there's a lot to write in there, so I didn't write...

[00:10:17]
Yes? Are there any tools where you could capture a live session, you could say, "Great, this is a good session," or maybe tweak whatever to generate all that, you know, test data versus manually creating all that? That's a great question. Yes, every tool, including Laminar, does that. That's what that would be considered like part of the live eval stack. So how you expose that, because you need a little bit of labeling, right?

[00:10:42]
Because like, how do you know it was a great session, right? That is where like you'll see the traditional thumbs up, thumbs down, you know. So thumbs up is like, "Oh, they said this was a great session. Add that to what we call the golden data set. This is everything that is good. We should have good evals here because they said it was great," right? If they say thumbs down, that's the stuff we need to change.

[00:11:14]
We need to go run evals on those things and try to fix them, and that is a team's job. There are billion-dollar companies whose product is just that. So yes, I'm glad you're thinking that way because yes, that is for sure a thing. Okay, so ran these. You know, all the evals ran, so like here are the ones that I really care about were the multi-turn ones, so let's go look at that into my browser. Why did...

[00:11:54]
Why does this do this? I don't know. Hold on. It's like the link they give me in the terminal doesn't work like almost ever. But you can just go here manually, go to my evaluations, uh, agent-multi-turn right here. And here's the one that I just ran. Huh, 100%. Look at that. Our agent's perfect. It's flawless, right? So we can look at this. We can see that here was the package.json one, right? And we can look at the...

[00:12:20]
Come here. I'm going to see the score. Actually, let me... Well, the score is 10. I didn't put the... I was going to say I wanted to see the reason, but I just realized I didn't actually... I didn't include the reason in the return function, so we're not going to see it up here, but that would be super helpful to like be able to see why did you score it this way, right? It's like as a human annotator, you can see that, but yeah, output quality, we can see right here.

[00:13:02]
Let's see... Uh, category of mock tools sales. Now where's the one that I am... I guess it's just going to be the score. Never mind. I haven't had to send the reason to Laminar. Yeah, I was trying to see the reason, but yeah, we had to send that up. I just didn't. But yeah, we all scored 1, 10 out of 10, out of 10 for all of these, which it's pretty good. I didn't expect it actually. So, you know, I guess I had some good prompts in there.

[00:13:37]
But just so we can see like how this might fail, right? Like if we were to go... Uh, let's go poison its data, right? So like if I go up here and I say for this one, let's say... I mean there's two things, there's a few things we can do. I can say actually I expect you to call shell, right? So I expect you to call the shell tool instead. Let's see what happens if we run that and watch. This time it'll actually like pick the shell tool, like, "Oh wait, no, I'm going to pick the shell tool.

[00:14:24]
That was actually a better tool." Like chasing ghosts at this point. So let that one run. Okay, that ran. Let's go back. Go back here to the next one. Can you go back? Thank you. Is it this one? Erroneous attempt? No, it's not this one. What is this? This is... Oh, that's... Yes, there we go. Here we go. It's this one. No, no, it's still 100%, so yeah, I guess it did pick the shell tool this time.

[00:14:59]
So, oh no, I'm sorry. No, that was because that eval is not being evaluated, right? I changed the wrong one. So if we go here, I don't have a score for tool order. I only have... The only way I could change this to get this to fail, it would be, let's say, I would have to... Because it's doing the LLM as a judge, so I had to try to convince... I had to like try to convince the judge something was wrong.

[00:15:30]
Like for instance, if I said "Read the package.yaml" or something, and use the shell, or actually we could just say like, "Read the package.json and use the shell," and then the output of that versus the tool calls, it would see that and be like, "Uh, well, I guess in that case it's very subjective because the judge might see that and still be like, 'Well, it still did the thing. It doesn't really matter because it's just focused on output.

[00:15:56]
It's not really focused on how it got there.'" That's what these individual scores are for, like the tool order and the tool score. The judge might look at that and be like, "Yeah, I don't really care. The output... It got you what you wanted," so in this case, yeah, this really wouldn't poison it. And I think that's good. It's really hard to like try to trick that into doing something else because it's judging based off of its own opinions.

[00:16:24]
Its own... I guess it doesn't really have opinions. It's hard to say, but like you get what I'm saying. It's not deterministic. It is evaluating it based off of the semantics of the result and not something that's deterministic on like whether the order was there or not. Yes? I noticed in the terminal it looked like you were getting different results. It looked like 0.77 and not 1. Was I looking at something?

[00:16:54]
I think that was for the other evals that I'm running because I'm still running the single-turn ones. So yeah, these are from the single-turn ones. Yeah, see the shell tools eval and then the file tools eval. Okay, so that's not the score of the eval, right? Or, well, so I have three evals on here, right? I have the file tools one that we made earlier. There was an additional shell tool one that you could have done.

[00:17:24]
That one's also in here. And then the one that we just did is the multi-turn one. Okay, these are three different evals. They all ran. I'm running all of them, so they all ran at once. I'm just... So I saw in Laminar it gives you the result of 1, but does that result show in the terminal at all? Uh, right here. Ah, yeah. Any other questions on multi-turn evals? Yes? If you were trying to support a foreign audience or folks speaking other languages, would you do anything different, like create a judge to make sure the responses were in the same language as the prompt?

[00:17:56]
One hundred percent. That's a great eval. Yeah, if your product was supporting multiple languages and you wanted to ensure that your model was responding, it understood the input language and then it automatically detected that and made sure that its output language was the same, yeah, that is probably something you would want to do. If that is a decision that you left up to your agent to decide, then yes, you want to eval that.

[00:18:21]
You could probably also try to figure out how do we make that deterministic to where we don't have to eval that. And that might be before we pass this to the agent, let's look at the input and classify it. So we'll have another LLM or some other AI that's not an LLM that is a classifier, and it classifies what language this input is to its best knowledge, and then it passes that input into the agent who's like, "Hey, by the way, here's the language," so that LLM doesn't have to detect what the language is.

[00:00:00]
It's already told what the language is, and that way it's a little more deterministic. So in that case you probably still need to eval, but you wouldn't need as many scores like detection and stuff like that because there's another filter before that that classifies it for you, and you can just write code for that because it outputs a specific string, like "this language" or "this language" or whatever known languages you want to solve for.

Learn Straight from the Experts Who Shape the Modern Web

250+
In-depth Courses
Industry Leading Experts
24
Learning Paths
Live Interactive Workshops

Get Unlimited Access Now