Check out a free preview of the full AI Agent: From Prototype to Production course
The "Evals" Lesson is part of the full, AI Agent: From Prototype to Production course featured in this preview video. Here's what you'd learn in this lesson:
Scott introduces evals that test and score responses from LLMs and agents. Datasets are required to help an eval understand the quality of a response. Datasets can be curated or provided by users who continuously rank responses they receive from an agent. This measuring is an essential component of production agents to ensure high levels of quality.
Transcript from the "Evals" Lesson
[00:00:00]
>> Scott Moss: Let's dive into it. So evals, honestly, this could be a course by itself, and it would last a week, and still wouldn't be that good. [LAUGH] It's just really hard stuff but it's necessary, you have to do eval. So what are evals? TLDR, and I know people who are in data science or machine learning or LLMOps, when they hear this, they're gonna cringe, but TLDR, evals are essentially tests for your LLMS, right?
[00:00:29]
So that's the best way I can describe it but not exactly, you use it for testing your LLMS. Imagine writing a test for a function and the function returned a date, but the way that the function got the date was using math.random, and the test was supposed to test what the date was gonna be.
[00:00:53]
How would you write tests for that? You really couldn't, and that's what it's like testing an LLM because it's non-deterministic. There's a bunch of randomness associated with it, so you can't write deterministic tests to get an output to some degree. So that's where evals come in. So evals not only help us test the outputs of our systems, but also the accuracy, the quality and not just the output but all the moving parts, once we talk about RAG, there's retrieval, there's hallucination.
[00:01:28]
There's so many things that you can test, the moving parts of your agent-based system. And I would say a lot of the research that's happening today in this space is around evals, it's around new metrics that people are discovering. How to discover, how to plan, how to execute these things, how to improve once you run evals.
[00:01:51]
This is a team of people, this is their job to do this, so I also recommend if you're looking for a new skill to pick up to be relevant and the job force of tomorrow, this is the skill, in my opinion. This is one of those skills that I think if you knew very well, yeah, you would be highly valuable on a team because everybody's gonna be using LLMS.
[00:02:14]
And I promise you, if your team doesn't know how to use LLMS and you go grab OpenAI and you do a wrapper around it, you put it in production, it's gonna be bad. It's gonna be really bad, and you're gonna have to figure out how to improve it and this is the next thing you're gonna have to do.
[00:02:28]
So they're gonna want someone who can do this already, because there really aren't any all-in-one platforms that will do this for you and improve it, so this is a very critical skill, in my opinion. Unlike testing, yeah, cool, maybe you don't like writing, test, and react. Yeah, you probably could still get a job, no one's gonna hire you unless you're QA, no one's gonna hire you cuz you can write good just tests, but evals, no, that would be your whole job.
[00:02:58]
That would literally be your whole job is just to do this. All right, enough ranting about that, so I talked about the challenges of evaluations. Yeah, it's non-deterministic for the most part, so how do you write tests against that? And then let's talk about the different types of evaluations.
[00:03:14]
So you have what I call the automated evaluations which are basically, you ever use ChatGPT or Claude or anything and they have a thumbs up on a response? You'd be like cool, this was good, or this was bad, and you can explain why? The reason they'd have that is because what you're doing is you're labeling data for them so they can use that in an evaluation that they have set up on ChatGPT.
[00:03:43]
They're evaluating it, but the best way to evaluate something is to have some data, and you'll see in a minute, where you can say, here is the input, and here was the expected output, and here's the actual output. And with those three things, you can check that against tons of different metrics to determine whether or not the system is doing good.
[00:04:05]
So the part that they're missing is the expected output, and that's what the thumbs up and thumbs down is. If you put thumbs down, then you didn't expect that, if you put thumbs up, that is exactly what you expected. So they can use that to help them create data sets for their evaluation cuz that's probably the hardest part, I think or the most time-consuming part, I think, not the hardest part.
[00:04:26]
The hardest part is coming up with metrics, so different types of metrics. This is where all the science is happening. I read some of this stuff every day, and I have no idea what any of it is, it's just beyond me, it's like research scientists level stuff. But for the most part there are things there that you can grok and understand to use to help you check things.
[00:04:51]
So you have things like response quality metrics, these are things that measure coherence, relevance, toxicity, classifiers. So, for instance, you might have an LLM, a smaller LLM that you use to test things, and then production, you go to something bigger. So you wanna check if the responses of the smaller LLM generates answers to the questions and whether or not they're relevant before you go to a bigger LLM or even vice versa.
[00:05:24]
So it's like checking the relevance of a question, tasks completion verification. This is a really good one, this is just ensuring that your agent went all the way through a workflow and did a thing. For instance, if we were generating an image, I would wanna know that if given this input, it went all the way through the right steps that I thought it was gonna do.
[00:05:42]
It went through an approval, the approval said, yes, it went through and the next thing that did was call the generate tool function. It generated the tool based off this output, and then it responded back with this, so the whole flow, right? I wanna be able to know that it did that based on this input that I gave it, and I can come up with different inputs.
[00:06:01]
Whether they're user defined ones that I log or synthetic ones that I made, which is something you'll probably start with. And tool accuracy is just a smaller step in that of just being like, did you call the right tool at the right time and interpret the outputs correctly?
[00:06:17]
So something that I realized a lot in production is, if the more tools you give one LLM, the worse it gets at picking the tool. So then there's different strategies on how to break that up. Do I have multiple LLMS, each with a group of tools? Do I have an intent classifier that's in front of this LLM and that LLM job is to decide what tool to pick so the LLM doesn't have to, they could just run it?
[00:06:43]
There's so many things you can do there. And then even if the tool was decided correctly, and it was executed and you fed the answer back to the LLM, it still might get the answer wrong for some reason. If you have a tool that gets the current time, and it returns the current time, and then LLM still gives you an answer for the wrong time, then, yeah, you would wanna fix that part.
[00:07:07]
That would be something you would wanna email like, why is the question answering part of my system really bad when the retrieval and the tool accuracy is really good? So what's going on there? So you would have to take that part out and run an eval for it.
[00:07:20]
So it really is like doing science, and that's the best way I can describe it. Dataset creation, I talked about it a little bit, at the end of the day, there's really just, I don't know, there's two main ways, there might be a third way. You either create synthetic data, so you just make up the data.
[00:07:38]
Some of the stuff is actually really easy to make up for, we'll be making up our data today. Cuz, for instance, if I know my agent can generate images and I know if I ask it to generate an image, it should select the generate image tool, then I can just make data sets that are inputs of users asking to generate some type of image and whatever phrases I want.
[00:07:58]
I can even ask an AI to come up with those inputs for me and put them in there, right? I could also have an AI generate inverse inputs, give me some inputs that a user might say that has nothing to do with generating an image, so I can eval that my agent doesn't call this tool, right?
[00:08:16]
So those are examples of synthetic data where you kinda know, so you just make some of them. The other way is obviously getting data from users in real world scenario. So if it's a chat app, if it's just an input bar that you have that's a one-off thing, you lock those, you capture those inputs, and then you can use those as a golden data set.
[00:08:38]
And then if you offer that ability for a user to give feedback on the output, whether or not this was good or not, now you have the expected output to go along with it, very similar to what I said with ChatGPT.
Learn Straight from the Experts Who Shape the Modern Web
- In-depth Courses
- Industry Leading Experts
- Learning Paths
- Live Interactive Workshops