Lesson Description
The "Evaluators" Lesson is part of the full, AI Agents Fundamentals, v2 course featured in this preview video. Here's what you'd learn in this lesson:
Scott discusses evaluators, which score tool outputs against expected results, noting that deterministic JSON is easier to quantify than text. He demonstrates a tool selection score evaluator that compares expected and chosen tools to calculate precision.
Transcript from the "Evaluators" Lesson
[00:00:00]
>> Scott Moss:What we want to do is we need to make some evaluators. So I already have one evaluator in here. And an evaluator is essentially, it's like an assertion, right? It is like the test that you would think of. It's the thing that turns the qualitative thing to a quantitative thing, right? So this thing would look at the output of something and the expected output of something and it would try to give it a score.
[00:00:26]
That's what it's trying to do, right? So how do we do that? Well, because tool calls themselves are structured outputs, because when the LLM says call this tool, it always spits back a JSON object. If it's a JSON object, then we can write code against that that is deterministic. It's not spitting out a string, it's spitting out a JSON object or more technically an array of JSON objects that say here are all the tools I want you to call.
[00:00:49]
So that should be pretty easy to quantify because we can write code against a JSON object. Now if we were testing for the output of the AI, this is why multi-turn is different because multi-turn is not going to return an object. In our case, multi-turn returns text. It's going to return the results of the task that you told it to do. It's not going to return you an object. In that case, we would not be able to do this.
[00:01:11]
We would need like another LLM or another human to evaluate the output of that because it's text, it's language. We can't score language without intelligence, so we need some intelligence to score that language. But in the case of did this tool get called yes or no, yeah, that's deterministic. Those are just objects we can test for. So this one right here that I already have, tool selection score, it gives you a score on did you pick the right tool selections, right?
[00:01:43]
So for instance, I look at the expected tools that I said that this data, this piece of data should have, right? For instance, if I go here and this data, right, I can say the expected tools is just this run command, right? So, oh look at that, oops, there we go. I look at that, I look at the ones that the LLM actually selected, okay? And then I see how many hit, right? If the hits are one for one, you're going to get a 1, right?
[00:02:15]
If the hits aren't, there is this thing called an F1 score that I don't know exactly how it works, but I've used it enough to find pretty accurate. It's going to give you a score based off of that precision. It's not a one for one thing, but it scales nicely from 0 to 1. So it's going to give you a score based on like how many hits you got correctly from how many the tool calls that were expected versus the tool calls that you gave it, right?
[00:02:43]
So this is technically called a scorer, but we're calling them an evaluator, right? And our notes, I have examples for like two more. We don't have to write those. I think it's going to take a little too long. I want to keep it moving, but we don't need them. I just wanted to show you different ways of like how to score stuff, so I guess I can kind of go over them, but we're not going to go through them.
[00:03:12]
So I basically have like tools selected, so in this case this was just to see if you actually like selected a tool and the order of it. This one tracks order as well, so this one tracks if you avoided the right tool. So like if the target had something on it called forbidden tools. These are tools I expect you not to pick. This is great for a negative case, in the sense of like, tell me a joke about programming.
[00:03:47]
Well, I expect you, but I'm going to give the agent access to all these tools. I expect the agent not to call any of these tools based off this prompt because I think this is a negative prompt. This is something we don't want to handle, right? So that's what that scorer is for. That scorer is checking like, okay, out of all the prompts that you, it should not have or all the tools you said it should not have selected, match it with the tools that it did select.
[00:04:17]
If any one of those is selected, it's a fail, it's a 0. It's a 0 or 1. There's no, you know, F1-ish score here. If you selected any tool that I told you not to select, you fail. That's an immediate 0. Does that make sense? Like it's pretty simple. It's like, it's three lines of code, it's nothing crazy. As you can see, you can get really creative with these scores. And this is just scratching the surface and this is just like single turn.
[00:04:44]
Multi-turn stuff is insane. It's absolutely insane. So again, this is someone's job to come up with metrics and figure out what matters, create them, update them, figure out interesting ways as the agent gets more abilities and tools and possibilities, update the scores, are they still relevant? Are there new ones we need to introduce? It is literally science and art and like kind of guessing. It's like a lot of this stuff really.
Learn Straight from the Experts Who Shape the Modern Web
- 250+In-depth Courses
- Industry Leading Experts
- 24Learning Paths
- Live Interactive Workshops