Handling Evals on Subjective Inputs

Netflix

Lesson Description

The "Handling Evals on Subjective Inputs" Lesson is part of the full, AI Agent: From Prototype to Production course featured in this preview video. Here's what you'd learn in this lesson:

Scott creates two more evals. One measures the dadJokes tool, and the other measures the generateImage tool. Since users may refer to an image as a "photo", the prompt is updated to help the system handle both scenarios. This demonstrates the interactive process of better accuracy.

Join Now or Start a Free 5-Day Trial

Preview

Transcript from the "Handling Evals on Subjective Inputs" Lesson

[00:00:00]
>> Scott Moss: All right, let's make the other evals so we can see some more experiments going. So, I'm going to make a new experiment here in my experiments folder. Let's make one for the dad jokes. So I'll say, dadjoke.Eval.ts just like that. And because it's pretty much gonna be the same thing, I'm just gonna copy what we had here in the Reddit, the whole file, just gonna paste it into the dad joke.

[00:00:38]
And I'm just gonna swap out the Reddit tool call definition to the dadJokeTool call definition, and then switch up the inputs, obviously. So instead of Reddit, this will be dadJoke. Instead of this, this would be dadJokeToolDefinition.
>> Scott Moss: I'm gonna change my experiment name to dadJoke.
>> Scott Moss: And then I'm going to
>> Scott Moss: Change my input here to be, tell me a funny dad joke, and I'll get rid of this other one.

[00:01:31]
>> Scott Moss: So I hope my system will call the dad joke tool if someone says tell me a funny dad joke. If not then it's pretty bad in fact we should go sabotage our tool to make it not respond to dad jokes just so you can see how we can improve it.

[00:01:48]
So let's try that after we run it so you'll see so you can see the process on how to We talk about evals, but then, like, what do you do when the evals aren't good? So we'll talk about, like, how to improve that. But for now, let's run it.

[00:02:07]
So the way you run it, again, if you're on Windows, you do MPX, TSX, evals, experiments, the name of that file you. So in this case, dadJoke, make sure you have your importdotemv/config at the top. If you're not on Windows, you don't have to do any of that.

[00:02:27]
You can just run NPM runEval dadJoke. I'm gonna do it the Windows way because it also works on Mac, so that way everybody can see it. So I'm gonna run dadJoke. And of course, I got 100% because I literally said in the input, tell me a funny dad joke.

[00:02:44]
There's no way you can mess it up, given the instructions that we have. So, gotta do that. Let's go look at our dashboard to see how we can view those experiments. I'll just refresh this right quick. There should be another dad joke in the drop-down here. And you can see the only one that I ran had 100% so there it is.

[00:03:07]
>> Scott Moss: Now, let's sabotage our dad joke right now. It has some pretty good instructions on the prompts, I think, it says, get a dad joke. So it knows to get a dad joke. But what if I wrote a really bad description here, like, use this tool to get the weather, literally a bad description.

[00:03:32]
But my eval still expects an input from, tell me a funny dad joke to call that tool. Do you anticipate the LLM to call this tool when the LLM thinks that it should call this tool to get the weather when the user asks for a dad joke? Probably not.

[00:03:52]
So I would expect when I run this that's it's gonna be like. Actually, it still did it. [LAUGH] I don't know why it still did that. Maybe it's the name. Let's try to sabotage it here. Let's call it weather tool. Something like that. Let's see what happens. I did not expect that.

[00:04:28]
>> Scott Moss: Okay, yeah, I guess the name does have a lot of influence on that. And I think that's because I'm using 4.o mini. The 4.o mini is really good. I think if you did that on 3.5, it would not get it. So let's say you had a poorly named tool, a poor description, but yet you expected to do something else.

[00:04:48]
So this eval would tell you that. This eval would be like, wow, it didn't get it. So what do I need to improve here? Well, if we're detecting whether or not the AI is picking the right tool, and the only way to influence whether or not the AI can pick the right tool is the name of the tool that you give it, the description of the tool that you give it, and then some other instructions in the system prompt, well, guess what?

[00:05:13]
Those are the only things you can play around with to improve this, so once you see that score, then you then come in and like, okay, well, I'm gonna call this dad_joke. Let's not call it weather, and then I'm gonna say returns, a random dad joke.
>> Scott Moss: I would make my improvement.

[00:05:35]
If you want historical references of your improvements, you technically could just make another experiment in here and call it dad joke, V1 or V2, and make a whole nother experiment. So you can see all the improvements you had over time. But because the other thing you could do is in our eval tool, I'm not doing it, but this thing can also log the metadata associated with it.

[00:05:57]
So, what was the system prompt? What was the tool call definition? So we can log it at that time. So you can go see over time what the tool call definition and name was, and the system prop name was at the time when you ran that experiment, cuz you're gonna be changing those things over time so you can see what not to do, right?

[00:06:13]
And then you can actually take all of those and write me the most optimized one. Here are all the results of my evals with the system prompts and the tool call names and definitions and the scores they got. Write me the most optimized one that would give me the highest score, and then it would write you the right name and the right description and the right system prompt to maximize that score.

[00:06:36]
So this is why it's very important to log this stuff and keep it. So, yeah, you would make that improvement. I'm gonna make that improvement. I'm gonna run this again on the same experiment, and of course I expect it to go up. So that is the process of running your evals, figuring out where in your chain, where in your LLM, has the influence to make a difference.

[00:07:02]
Make that change, run your eval again, see what happens, right? And then also adding more data. So now, if we go back to here, we go back to dadJoke, yeah, there was 100, 100, then it went to 0, then it went back to 100. I think this will only show the last 10 sets, the last 10 runs.

[00:07:21]
It won't do more than, I don't think. Let's do one for generate image. So I'm gonna say, generateImage.eval.ts. It's very similar, so I'm just gonna copy it, bring it in here, paste it. Import from generateImage, select the dadToolDefinition, and change that to generateImageToolDefinition. I'm gonna get a new experiment name.

[00:07:59]
Gonna call this generate image call whatever you want. Cursor is already telling me that inputs jacked. So I'm just gonna go ahead and take that suggestion, and I'm gonna say, generate an image of a sunset. So I expect when a user says, generate image of a sunset, for this to actually call the Generate tool message.

[00:08:19]
Now we're not actually running the functions. We're just seeing if the AI is telling us to run the function, right? So let's try that one. So now I'm gonna go here. I'm gonna say, generate image eval. And of course I get 1000 because I explicitly asked to do that.

[00:08:42]
So, you might get a user saying, let's say you have some analytics and you anonymized it. So you can kinda see what people are saying, but you don't know who they are. And then you saw that someone said, take a photo of the sunset. Would you expect that to generate an image?

[00:09:07]
I don't know, it depends on what you want out of your app. If you want your app to be clever enough to know that taking your photo also means generating an image, i you didn't think of that ahead of time, you would have had waited to a user to have said something like that.

[00:09:23]
It would have had to fail, you would have had to collect it, and then you would have had to address it, right? I actually don't know what that's gonna do, so I don't know. If it fails, we'll address it. If it doesn't fail, let's see what happens. It did fail.

[00:09:39]
So it doesn't think taking a photo is rendering an image, but I want my LLM to be that clever. So what do I do? How do I address this? How do I improve this? Because for some, my LLM, my agent is like the number one users are photographers.

[00:09:55]
So, they always say, take a photo. So, that's why that's happening. So those are my number one users. So, I need to, like, instead of telling them, like, hey, stop saying take a photo, I need to make my LLM address take a photo.
>> Speaker 2: How do you handle it when eval-ing things that aren't just a bully and call this tool or not?

[00:10:12]
It could be, is this structured output valid? These kinds of things.
>> Scott Moss: We'll get into that right after this. There's so much of that. You'll see why I had to do this first before I show you that cuz it's just a can of worms, it's just crazy. But we had to do this part first.

[00:10:32]
So, yeah, I have an LLM that is on a photography app and most of our users are photographers, and they love to say, take a photo when they actually mean generate something. So instead of telling them to say, stop saying that, I should just tell my LLM to handle that, right?

[00:10:46]
So how do I do that? Well, it failed so I'm gonna go here. I can put something in this description. Hey, taking a photo is the same thing as generating an image. Maybe I'll start with that if I could spell photo right, there we go. I'll start with that.

[00:11:08]
I'll start with that description. I'll add that, and let's see if that's good. So I'm gonna rerun this.
>> Scott Moss: And it did. It improved it. So now, I feel pretty confident that if some user came in and said, take a photo, that it will call the generate image function.

[00:11:26]
And this is the process. You do this over and over and over and over. And this is why it never stops. Do this forever until eternity, because people will just find new ways, especially if you're a chat app. If you're a chat app, every day is just, you just go see what weird stuff people are saying, and you try to address it.

[00:11:46]
That's it [LAUGH]. It's really hard, but also kinda hilarious and fun. Somebody tried to jailbreak our AI the other day, and I was like looking at the stuff that they were trying to say, and it was so hilarious. It's so funny, yes.
>> Speaker 3: So if you look at the dashboard for your image.

[00:12:07]
>> Scott Moss: Yeah. Let's look at the dashboard.
>> Speaker 3: So I noticed that it this only went down to 50% and then back up, as opposed to when you look like dadJoke it was a very binary zero and one. What's-
>> Scott Moss: What's the difference there? Yeah.
>> Speaker 3: Yeah.
>> Scott Moss: So I'm pretty sure it's because I had different amounts of inputs, right?

[00:12:31]
So I think on the dad joke only at one input. So it was always 100% right or always zero. But I think on generate image, I had two inputs. So it could be 50% and it could not be 50% so
>> Speaker 3: Remember the average of the results.
>> Scott Moss: Establish the results, yeah.

[00:12:47]
>> Speaker 3: And so it's actually running two-
>> Scott Moss: Yeah, it's running two prompts.
>> Speaker 2: Yep.
>> Speaker 3: Against-
>> Scott Moss: Yep, and it gets the average of both of those.
>> Speaker 3: Gotcha.
>> Scott Moss: Exactly. Yeah, hope you guys see the power of just having evals in general, because if you didn't. So not only do you need the eval, you need to log what users are saying, and you need to save that somewhere so you can do this part, otherwise, there's literally no way you'll ever improve your.

[00:13:14]
Of your system is like, how would you have known that? This is what people were saying. It's so random. So, you have to be able to do that, but once you do it's like, okay, the first level improvement is prompt engineering, and that's what we've been doing. But there's many levels of improvement.

[00:13:30]
Prompt engineering is just the first thing you do.

Learn Straight from the Experts Who Shape the Modern Web

In-depth Courses
Industry Leading Experts
Learning Paths
Live Interactive Workshops

Start a 5-Day Free Trial