What to Measure with Evals

Netflix

Lesson Description

The "What to Measure with Evals" Lesson is part of the full, AI Agent: From Prototype to Production course featured in this preview video. Here's what you'd learn in this lesson:

Scott discusses what evals should measure. Quality metrics can help assess how well tasks are performed. Measuring also adds safety and reliability to the application by ensuring the system operates within acceptable bounds.

Join Now

Preview

Transcript from the "What to Measure with Evals" Lesson

[00:00:00]
>> Scott Moss: This is the part that I struggle with the most. And it makes sense, because this is the part that's most valuable. If this is what you were good at, is deciding what metrics to use and how to put those metrics, you would be employed for a long time, because this is the hard part, I think.

[00:00:15]
So, yeah, none of this really matters if you don't really know what you're measuring. And again, people are just really, it's so funny, it's like, people are getting really creative with how they measure stuff. But at the end of the day, there's basically three different types of measurements that I've seen.

[00:00:32]
There's LLM-based measurements, where you use an LLM at the end. I think it's called just like, AI as a judge, or LLM as a judge, where you use some form of an LLM with some prompt or multi-turn LLM to grade for some type of metric. So it might be, let's say you have an LLM whose job is to take in some messages and summarize it.

[00:00:55]
How would you measure that? Well, a really good metric would be, I wanna take the output of this summary, and I wanna take all the messages that I gave it. And then I wanna use another LLM to look at those messages in this summary to see if there's some semantic similarities between them.

[00:01:11]
And then if so, score it from zero to one, and then that'll give me a number on whether or not this was semantic related. So that's like a semantic similarity score, or something like that. And that's just one metric. You might have many of those metrics, you might wanna check to see.

[00:01:26]
Well, I specifically told this LLM that when it summarizes that it should include entities, so names, places, things like that. So I wanna check for entities. So I'm gonna make another metric that's another LLM, that checks the summary these chat messages, parse out the entities, and see if they're in the summary or not, right?

[00:01:47]
And then, if so, I think there should have been four entities in here. If there are four, that's a one. If there's only one, then maybe that's a 025, whoever you wanna grade, it's really science. [LAUGH] So it can get pretty tough. Luckily for us, smarter people have created a lot of these metrics and frameworks that we can use, and they have really good documentation, you can kind of play around with them.

[00:02:14]
I also found that it's very helpful to use something like Claude or ChatGPT as a parent partner to think about these metrics and to go through and try to figure out what is the right thing to test here. So, there are a lot and there really are no wrong answers on things you need to test.

[00:02:34]
But what I can tell you is, if you do not write evals and pick the right metrics, your AI product is gonna be bad. [LAUGH] It's gonna be really bad and you're not gonna know why. And that's the difference between writing a test in your code and an eval.

[00:02:51]
Because a test is binary, it passed or it failed. An eval is a range, it's like, in what threshold is this, right? What is the accuracy of this, right? Is it 50%? If your agent is a financial agent, and it's 50% accurate at doing its job, I doubt if anyone's gonna ever use that product, because it only gets right 50% of the time when dealing with my money.

[00:03:16]
I don't think I want that. I would imagine for something financial, it has to be be, minimum, 90%, even then, that's probably too low for anything financial. How do you know what that is, though? You have to measure that. You have to figure out what you're measuring. And you have to know the different, so that way you can improve those things, all right?

[00:03:32]
Cool, here's some practical implementation strategies. This stuff doesn't really matter for the most part. Everyone kinda does it differently, but at the end of the day, you basically do offline evals, and you do evals when you deploy, that's essentially it. So we're gonna do the offline evals. An offline eval is like, you just run it from your computer.

[00:03:54]
We're gonna run this eval, it's gonna give me a report. I'm gonna make some changes, and we'll run the eval. So it give me a report, I'm gonna make some changes, and you just keep doing that, and then you deploy your changes to your code like you normally do.

[00:04:04]
But you can also do them inside of a GitHub action or something like that, just like you would do any other CI-based test. You could do evals there as well, but you probably will just be doing a lot of offline evals, I would imagine. But there's no wrong answer, I know some people that, they have a chat app, every X amount of 100 messages they get, they run the evals again off of those new data sets just to see what has changed.

[00:04:31]
Stuff like that. It's really good to be able to do that, because if users are annotating the outputs of your LLM and telling you whether or not they expected that. You can have some type of alerting system when you hit a threshold of, wow, a lot of people said this is not what they expected.

[00:04:47]
So now you know you have a degradation of your AI system and it's not really good. So you might get notified about that and you can just go see, they're all asking this question, let's just add something to the system prompt to address that. Boom, fixed it, right?

[00:05:01]
And you would not have known that if you didn't see that. So you have to have those evals put in place. I talked about human feedback, it's free data, at the end of the day, you're trying to get a golden data set, right? A golden data set is a data set that's It's already labeled for you.

[00:05:19]
It says, here's the input, here's the expected output. That's a golden data set. You can synthesize that, but getting it from a user is really tricky. So if you can get that, you're good, right? And then in some scenarios you don't, like for instance, our system, we were trying to train our system on how to determine what is an important email to someone or not.

[00:05:42]
And we were able to create that data set to test, for instance, on my own email while going to Gmail, I would put a star on every email that I thought was important to me. And then I would export all my emails. And when I exported them, it had the Important label on it if it was a starred, or, I think it had to star label on it if it was starred.

[00:06:01]
That's a golden data set. It was annotated that these are important emails, and the ones that don't have it aren't important. So I was able to train an LLM on those. So it was basically free. But you're not always gotta be able to do that. So those are just examples where it's like, wow, I'm glad I got that.

[00:06:17]
So, yeah, getting the data is hard. And this is why it takes time to improve an AI system, not because it's like, you need all the time in the world to implement this code. It's just because, you need to collect the data, and collecting the data takes a lot of time, especially if you don't have a lot of users, or a lot of users giving you feedback.

[00:06:32]
So sometimes you just need to collect enough data to see what's going wrong so you can improve it. And yeah, that's pretty much the strategy that I see today. What I see today as far as people getting things out, they'll prototype something kinda like what we did in the other course in a day, maybe two days.

[00:06:49]
They'll ship it either to some folks internally or some beta testers who know that the thing is gonna be busted. If they were to eval it right now, it'll probably be below 50% on everything. It'll be really bad. They put it out there, and then they just collect the feedback.

[00:07:05]
And in the meantime, they're writing the evals. They collect that feedback, they feed it back into the eval system, and then they start improving it. And once they get past some threshold that they're comfortable with, let's say, I don't know, 70%, they open the door so those other people do it.

[00:07:20]
And then they collect even more feedback. And then once they feel called, when they get it to, I don't know, 90%, then it's like, cool, let's get this out to everybody, right? So that's typically how I see things happening for a lot of people. And depending on how many users you have and your feedback loop, that could take a month, it could take three months, it could take a year, who knows?

[00:07:36]
>> Speaker 2: So you said, running them locally, how do you manage the cost? I'm assuming, if you're running tests against real LLMs, there's costs associated with that, what's your kind of, strategy around?
>> Scott Moss: There is none, you just gotta pay for it. [LAUGH] I think the only time you wanna manage costs is when users are hitting it, cuz then it's not in the controlled environment.

[00:07:59]
>> Speaker 2: Yeah, sure.
>> Scott Moss: But when it comes to evals, I mean, there are some, yeah, when it comes to evals, the only time you're worried about cost is when you're trying to reduce the cost of the user facing systems. So that is a constraint that you add to, like for instance, if I know that when my users use my agent, they're gonna be using GPT4o mini, but I'm evaling with GPT 4.0, it doesn't really make sense.

[00:08:23]
I should probably eval with GPT 4.0 mini as well, which reduces the cost of my evals. But for the sake of reducing costs for evals, no, there's just no way around it right now. I mean, obviously, you can use Local stuff, but at the end of the day, you wanna be evaling the exact same systems that your users use.

[00:08:44]
So yeah, there really is no way around it. But it's really not too bad. I mean, we run evals, we have hundreds of thousands of records, and, I don't know, and maybe cost me a few bucks a month to run these evals all the time. So it's not too bad, but then again, I'm not at scale, [LAUGH] and maybe some other company with way more users than me, maybe they might be paying hundreds of thousand dollars a month on evals.

[00:09:11]
>> Speaker 2: [INAUDIBLE] they can afford it by that point.
>> Scott Moss: Right, they can afford it at that point. And they have to write, I think, the amount of data that you process kinda somewhat, I mean, it eventually tells off. But I think there is some type of correlation, it's not linear, between how many users you have and how much data you're processing on your evals, right?

[00:09:30]
So it does have some type of correlation there. But I think at Args who are really serious about using LLM or production, they probably spend 20 to 30% of their time just doing evals. It's a significant part of their time. Otherwise, you have people that are just doing it all fives, which is the current meme right now.

[00:09:50]
It's like, yeah, that looked good when I ran it ten times. That looked good. So I'm gonna deploy it and just be done with it.

Learn Straight from the Experts Who Shape the Modern Web

250+
In-depth Courses
Industry Leading Experts
24
Learning Paths
Live Interactive Workshops

Get Unlimited Access Now