AI Agent: From Prototype to Production

Eval Multiple Tools

Scott Moss

Scott Moss

Superfilter AI
AI Agent: From Prototype to Production

Check out a free preview of the full AI Agent: From Prototype to Production course

The "Eval Multiple Tools" Lesson is part of the full, AI Agent: From Prototype to Production course featured in this preview video. Here's what you'd learn in this lesson:

Scott creates an eval to test all the tools. The eval is populated with several data items and run. Depending on the score, modifications to the prompt may be required to achieve a 100% success rate. After creating the eval, some additional metric tools are discussed.

Preview
Close

Transcript from the "Eval Multiple Tools" Lesson

[00:00:00]
>> Scott Moss: Let's make one more. This one's gonna be called all tools, and it's gonna be when we give the agent all the tools at once, cuz right now, we've only been giving it one tool to pick. What if we gave it all the tools to pick and then check to see if it still picks the right tool.

[00:00:19]
Given that it can pick many tools, I wanna determine if it can pick the right tool, if given more than one option to pick a tool, right? Which is more real world. So let's do that. And I think this is mostly all the same code as well. Let me just double check.

[00:00:37]
I'm trying to be better at being consistent with the notes that I have here
>> Scott Moss: Okay, yeah, it's mostly all the same code, so I'm just gonna copy what we already have. I'll just take the dad joke one, go to allTools, paste that in, like that. I need to import the rest of these, tool definitions.

[00:01:05]
>> Scott Moss: So I got that.
>> Scott Moss: The next thing I need to do, I think I have a helper in here. No, it's the same thing. So really all I gotta do here is now I just need to pass in all the tools to the array. Don't worry about this movie search one.

[00:01:22]
We haven't made that yet. And then I need to make some inputs for that same score, though. So let's do that, I'm gonna say allTools is all three of those. To my array down here, I'm just gonna pass in allTools like that. Change the experiment from whatever you had it to.

[00:01:49]
I'm calling mine allTools, and then from here it's just like, okay, let's come up with some inputs. I expect when I say, tell me a funny dad joke that you'll call the dad one. Great, what else? I expect that when the user says, take a photo of mars, that it will generate image definition.

[00:02:18]
And I expect if what is the most apvoted post on reddit to call the reddit ToolDefinition
>> Scott Moss: And however many more you wanna do. Again, ignore the movie one. We didn't do that one yet, I forgot to take that out.
>> Scott Moss: Okay, let's run all Tools. And let's see if this thing is smart or not.

[00:02:50]
>> Scott Moss: Oof, looks like two out of three passed. One of them failed. Let's go see, which one do you think failed?
>> Speaker 2: Generate image.
>> Scott Moss: Generate image, that's what I think too. I'm gonna run it again. Sometimes I like to run them many, many, many times and you get the average of all those runs but this was pretty consistent.

[00:03:08]
Look at that, it's always failing on the same one so what I can go do. So I'm gonna go look at the results, I'm gonna go all the way down here to see where it fell that's a one on. Which one is this? Reddit, this is a zero on generate_image.

[00:03:27]
Yeah, it failed on generate_image. Even though we said, hey, take a photo means call this, it's failing. So then we would wanna go back and then we would wanna go address this. So maybe this one wasn't good enough, so I'm gonna say, I'm gonna change the whole thing.

[00:03:50]
I'm gonna say, use this tool with a prompt to generate or take a photo of anything. Smaller prompt, but kind of tricks the AI to thinking that it's doing something, it's not, right? I could even change the name to be like, generate image or photo or something like that.

[00:04:19]
And I might help it, but let's start with that and see what happens. Or actually I'm tripping I need to put this down here I'm putting in the wrong spot that's why hold on. Let's put that here and then here it's just like the prompt to use to generate, an image or take a photo.

[00:04:53]
>> Scott Moss: Cool, let's try that.
>> Scott Moss: Boom there we go, self-healing it's good. I run it again just to be just to be sure there we go looks like that was really good. So let's go check out our dashboard look at our visualization. Look at that, that's a pay raise you show that to your boss like look.

[00:05:21]
We went from here two months ago, and now we're here. That's a pay raise, you could have statistical evidence of why you deserve more money with this chart. So there you go.
>> Jamie: Can you talk about more like real world scenarios that you get into with.
>> Scott Moss: Yeah, let's do it.

[00:05:44]
Let's dive into something beyond just what we're doing. And we're only going to talk about it, we're not going to implement any of this stuff, okay? So let's do that. We're gonna go to this repo that I'm using called AutoEval, so that's what we're actually importing. And here AutoEvals, let me go to the root of this thing.

[00:06:10]
Here's a lot of the metrics and scores that it has. So I talked about LLM as a judge. Here are a lot of them, I can talk about some of them. So the battle one is essentially like getting the result of one thing from your system versus the result from an internal one and seeing which one's better.

[00:06:31]
And then if yours is better, then you get a higher score. Closed QA is basically testing whether or not the LLM, whatever model you're using, can answer the question. Based off of the data that it's already trained on without having to use external data. Without you basically would test whether or not the LM already knows the answer to something without you having to give it more context with rag or a system prompt.

[00:06:56]
It should just know it. Like if I go as GPT 3.5, something about the Olympics in 1990, I would expect it to know that without me having to tell it because it should have been trained on that data. That's a close QA. So it's checking whether or not this LLM knows the information without you having to tell it, which is useful because then now you know what you should tell it.

[00:07:19]
Humor, I have no idea what that is, but I'm guessing it detects if this thing is funny or not. [LAUGH] If people want to do that. Factuality, it's exactly what it sounds is the it's checking to see if the response is factual, if it's real, based off of some context.

[00:07:38]
Moderation, exactly what it sounds like, some of these other things, I don't know, SQL I guess it detects if this is legit SQL, if it's executable SQL, not broken SQL. Kind of scary, then you have the rag ones. This is where it just like, you know what? This is where it gets really, I gotta go to another website to show you this one actually.

[00:07:59]
So remember something called ragas. Ragas is a Python framework that is literally its only purpose is to help you eval your rag based systems. I was going to do that. I was going to implement some of this today, and I got, I don't know, I got three hours into and I realized I was just.

[00:08:22]
Scratching the surface, and it wasn't worth it. It was going to be the whole it was going to be the whole course just talking about this. So I didn't do it, but it's worth talking about it. Braggers is really cool. So if you, if you go into their metrics, how do I go into their metrics?

[00:08:37]
Here we go their website. None of those things are clickable by the way. They look like they are, but they're not. [LAUGH] So if we go into their metrics. Where are they? There we go, available metrics. There we go. I really like the way they break it down.

[00:08:52]
So they have some good ones on here. Context precision, and this as you can see, this is what I'm talking about is starting to get deep you see this math already. Context precision is actually really good. This one, basically, I'll show you the code. This one basically takes a user input, what your LLM would respond with, and then what your vector database would have retrieved, and it will tell you how precise your retrieval mechanism is.

[00:09:23]
Dd your retrieval mechanism find the appropriate pieces of data to answer this question, given the response that your LLM gave? And they actually break it down on a mathematical way to tell you how they do that. So, I If this looks scary to you, it's because it is.

[00:09:45]
But AutoEvals has this in TypeScript. So we could use it right now. You have it, it's in your repo. But I decided not to use it because it can get pretty hairy. Here's an example of how to do it with a reference. So a reference is like you already ran the retrieval ahead of time and you already know what the reference is, so you don't need to run the retrieval again.

[00:10:09]
So you could just put it in here. So this is like, I'm just testing the retrieval part directly. I already ran them ahead of time. I don't need to run it again. So here's one and these two use an LLM to determine that precision. There's an LLM inside of this to tell you what's happening and I can show you that code.

[00:10:29]
And then it can also do it without an LLM, where it just like compares these two. So there is a lot going on here. There's like noise sensitivity context entities recall, faithfulness like there's a lot of stuff going on here. And again, there's probably a handful of people in the world who are just up to date on all this stuff.

[00:10:52]
And if it's not the creators of this, it's probably people who contribute to this. So this is not something you need to be up to date with and understand, but you need to know that there are systems like this that exist. And you can pull from and learn when you need to learn.

[00:11:05]
But I promise you, every day that I wake up, there's a new white paper. I'm like, a better eval metric, a new standard on how to eval things, right? For instance, if you're making a coding agent something that can be a copilot, you're gonna use SWE-bench, right? Like SWE-bench is something that you would run your coding agent through to detect how good it is at writing code.

[00:11:29]
And this is the standard for coding agents. And there's tons of these standards. There's tons of these metrics and scores that you use depending on what you're trying to test. This is why I said this is someone's full time job. This is why this is a team of people's job.

[00:11:43]
It's just to do this all day, which is also why I said, if you're looking for a skill to learn and like be valuable, it is this skill, in my opinion, this is extremely valuable.

Learn Straight from the Experts Who Shape the Modern Web

  • In-depth Courses
  • Industry Leading Experts
  • Learning Paths
  • Live Interactive Workshops
Get Unlimited Access Now