Creating an Eval

Netflix

Lesson Description

The "Creating an Eval" Lesson is part of the full, AI Agent: From Prototype to Production course featured in this preview video. Here's what you'd learn in this lesson:

Scott codes an eval for the Reddit tool. The eval's task calls the runLLM method with the input. Since the word "reddit" is contained in the input, eval is expected to call the Reddit tool.

Join Now

Preview

Transcript from the "Creating an Eval" Lesson

[00:00:00]
>> Scott Moss: Let's make our reddit eval. So essentially all the hard work is done because I already made this run eval function. We just have to call it with some data and our task. So let's do that. So I'm gonna go to our code I'm gonna go into evals here.

[00:00:19]
And where did I put this? This is, it's on the root, okay. Stop going there, thank you. So here I'm gonna say reddit and typically I like to do a dot eval extension. And yet you actually will have to do that. If you want the script that I built to run it for you, you have to do dot eval.

[00:00:38]
So if you don't do that, you cell phone get ran.
>> Scott Moss: And really, all we wanna do here again is just call this run eval with our experiment, name our task, our data, and our scores. So let's do that. So I'm gonna say, runEval that's gonna import that I'm gonna give an experiment name.

[00:01:01]
I typically just give it the same name as the file. You can call it whatever you want. It's gonna take an object here. The first thing is gonna be the task. The task is an async function that always takes in the input as its argument. So that's really all you need, it's an input.

[00:01:19]
So what am I gonna do with this? I'm gonna call our, I have an LLM a runLLM function here. It does exactly what it sounds like. It runs an LLM given an input. So that thing takes in a messages like this so you can pass in some new messages.

[00:01:36]
We just wanna make a new message here that is role of user, and the content is gonna be whatever the input is.
>> Scott Moss: So run this LLM with this new message, essentially.
>> Scott Moss: The other thing is we need to pass the actual tool definition to this LLM. So how could it call, if we're testing whether or not it were able, it was able to call the reddit tool, we have to tell the LLM that it has a reddit tool.

[00:02:13]
So we want to be able to do that. So the way we can do that is we can pass in a tools array here like this. And it's always gonna be an array. And I already have the reddit tool definition. You can import that from source tools reddit, so I'm gonna pass that in here to the tools array.

[00:02:33]
So we have our task. The next thing is, we need our data, so I'm just gonna put one in here. So remember, the data is an object with the input and then the expected. So I'm gonna make it an object here. I'm gonna say the input. I'm gonna write a user input that I think should trigger the reddit tool to be called.

[00:02:51]
So I'm gonna say, find me something interesting on reddit. I mean, I'm literally mentioning the name reddit here. So I would hope that it would go look at reddit, right? So that's the input. What do I expect to get back? Okay, so the way the expect, the way the expectation works, the value of your expected has to match the shape of whatever this returns.

[00:03:23]
So runLLM, if we go look at it, it returns a message. So it returns an object. So our expected value should also return an object, because we need it to match whatever the task returns, cuz that's what it's checking against. If they aren't the same format, it can't check against them, right?

[00:03:42]
So whatever your tax returns, whatever the shape of that, if it's a string, then you're expected as a string. If it's an object, then you're expected as an object. In our case, it's a message. So which is an object with a roll tool calls content whatever whatever. So we need to make sure that we do that as well.

[00:04:00]
So the expected. In this case, I expect to give back a message of role assistant that has a tool call array and a function whose name is reddit. I expect to see that because I asked for reddit here, so I expect the AI to call a tool with reddit.

[00:04:18]
So I have a helper function here called create tool call message, that'll just do that for us. So I'm gonna use that, actually, I'll, I'll just write it out. So create tool call message, and what does it do? Yeah, it just takes in the tool name, so,
>> Scott Moss: Which is a string, and it's just gonna return that shape of the object that I talked about.

[00:04:44]
So role assistant, tool calls, type function, function, name, tool name, right? So nothing crazy, so let's do that. It's gonna turn the message, role assistant, there we go. Tool calls, there's an array of objects. Inside of this object, there is a function that has a name that is whatever the tool name you passed in.

[00:05:16]
And I think there, yeah, I need type function, there we go, type function. So now we have that function. Now, when I go down to expected, I can just say, yeah, create a tool call message with the reddit definition.name. So that's our one input. For every one data object you have in here, this will be called one, one time, right?

[00:05:47]
So if I have 30 in here, this will be called 30 times. I'll have 30 runs for this one experiment. And the score is the average of amongst all those runs, right? So we're just gonna put one, we're not going to get crazy. But as you can imagine, if these were real user ones, you're gonna find out real quick how good your system is or how good it isn't.

[00:06:03]
And then you'll be able to find those. You can take it to the next level like, okay, all the ones that got a zero. Find me the common words across all those user inputs so I can get a word cloud. And then now, I know what words I need to add to my prompts and my instructions so it can catch them next time, right?

[00:06:20]
You just got to get really creative with it, it's like some detective stuff. Okay, so now that we have our data, the next thing we want to do is our scores. So we can go down here, and we can say scorers, and we're going to import our tool called match score here, so let's do that.

[00:06:47]
Import tool called match scores. Bring that down here. Wait, why is this thing freaking out? I forgot a comma, there we go.
>> Scott Moss: So now you should have a file that looks like this. RunEval reddit,
>> Scott Moss: Task is this input, we have this user here. What the input is the content we're gonna check to make sure that the AI has access to this reddit tool.

[00:07:17]
The data we want to test is find me something interesting on reddit. We expect it to return a message that has a ToolCall to the same name, and we're gonna check it against a ToolCall score. So now let's run this, and hopefully this doesn't burn. [LAUGH] So to run this, you can do npm run eval, and then the name of that file.

[00:07:36]
So in this case, the name of my file is reddit, should be able to run ads. And what did it say? Did I misspell my file, maybe reddit? You know what, you have to put it in an experiments folder, that's why. Sorry about that. Make sure you put these experiments in an experiments folder, then it'll catch it.

[00:08:03]
I forgot I did that. Okay, so now that it's in the experiments folder,
>> Scott Moss: I believe it'll work now.
>> Scott Moss: Cool, and there it is. So our previous score was 1, our current score is 1, the difference is 0. It's saying that because I already have some scores in here.

[00:08:22]
So what you can do is if you go to the root of your repo, there's a results.json there. I have tons of stuff in here. I'm just gonna clear this out, okay, hold on. I'm gonna clear this out, and I'm just gonna put an empty experiments array in here like this.

[00:08:45]
So that way, mine is clear. That way, there's no comparison, it should always just be this first one should just be the first one. Yours will look different than mine cuz you didn't have any, so I'm gonna run that again. And yours probably looks something like this, where it said, experiment reddit, previous score is 0, current score is 1, the difference is +1.

[00:09:05]
Yours might look something like this.

Learn Straight from the Experts Who Shape the Modern Web

In-depth Courses
Industry Leading Experts
Learning Paths
Live Interactive Workshops

Get Unlimited Access Now