AI Agents Fundamentals, v2

Understanding Evals

Scott Moss
Netflix
AI Agents Fundamentals, v2

Lesson Description

The "Understanding Evals" Lesson is part of the full, AI Agents Fundamentals, v2 course featured in this preview video. Here's what you'd learn in this lesson:

Scott explains single-turn evals, which track metrics from one agent pass, highlighting their importance for testing non-deterministic AI. He also contrasts offline and online evals, emphasizing their role in guiding improvements and informed decisions.

Preview

Transcript from the "Understanding Evals" Lesson

[00:00:00]
>> Scott Moss:Let's talk about eval. So, single turn eval, so we know what turns are, right? It's when you take the AI slop that the LM generates and feed it back to itself. So that's a turn, right? A single turn eval is just us evaling just one pass. So what are the things that we want to track in just one pass of an agent, not like a full run, not like we give you a task, you do your thing, how many steps it takes, how many tool calls it takes.

[00:00:27]
And then we get the result and we're evaluating the result. No, we'll do that later, but for this one it's just like on a one turn basis, what are the things that we want to look at? Is this the official thing that people do? They call them single turn evals? No, no there is no standard on evals. I'm just gonna tell you right now there are no standards. There are like opinions and things that we've collectively done.

[00:00:50]
But there are no standards. This is what I'm teaching you right now about evals is what I've done for agent evals is I like to eval on a single turn and all the things that matter to me on that single turn, and then I like to eval the whole system on a full turn, right? You could think of like single turn stuff is like unit testing almost and then like a full run would be like an end to end test. That's like the best way I can describe it, OK.

[00:01:17]
So why do we need evals and what are they? So, basically, we know what tests are, right? Tests are is code that tests other code and that code is deterministic given this, well, unless you have like some even if you have like mutative code that like in the case of JavaScript interacted with closures, and if you do shame on you, shame. Like, what are you doing? Like. You have closures, if you have closures in a global scope, go look in the mirror, right?

[00:01:48]
Just reflect, reflect, reflect a little bit. But I guess closures in the class is fine, but unit test is, you know, they're testing code with other code because we know the results given a certain input, we expect these outputs. We can predict what the output is gonna be, it's not that hard, it's A or B or C or D like the even if it's a lot of options, they're still finite. They're finite options, we can solve for them.

[00:02:12]
The more options that this thing can return, break your function up, that's too many options, but still you can still think of them, they're finite. That's a unit test, right? Um, end to end test is like I wanna know how all this works together. This button is clicked, this function is called, this API is hit, the server does this, the response is that, it's a full run, right? OK, it's very similar to agents, it's just that the code the output is nondeterministic.

[00:02:38]
OK, so if you're testing something that's always gonna be the same, then it's easy to write tests for it. If I ask the LLM the same thing 3 times in a row. Am I gonna get back the exact same thing with the same model? No. Anybody know why? Because, uh, LLMs are nondeterministic. Do you know why LMs aren't deterministic? Is there a seed. There is a seed you can, you can avoid that though. There is an option to like, you can reuse the same seed, yeah.

[00:03:16]
Oh, there's temperature, there's, uh, yeah, I mean I could put temperature on zero. Yeah, well, that should make it deterministic kind of. In theory, that's what that means. It means cut down the randomness. But they're still nondeterministic, right? It's actually funny, I thought I knew until just like a couple of weeks ago and someone came up with a research paper, I forgot what lab did a research paper, and they figured out, hey, how come these things aren't deterministic even when we give it the same seed and the temperature on zero, they should be deterministic, like what the hell is going on here?

[00:03:53]
And it actually has, if I remember correctly, it's on a kernel level on the GPU. It has to do with like how the math is done. And how different the result of the math is depending on the load in which the GPU is handling. So like if you hit the same GPU and you were the only request for that GPU and you had temperature zero, everything the same, you will most likely get a deterministic result every single time.

[00:04:26]
But because you're hitting a GPU that's basically also responding to other requests. There's something about that on the kernel level that affects the output that you're gonna get. It introduces a bit of randomness that was not part of an algorithm. It's an infrastructure related thing. It's an electricity related thing. And I was like, that is very fascinating. So that is actually why you get different um outputs when you think you might get something non-different and then, you know, some models also like to sprinkle in a little RNG in there just to, just to mess with you, you know, um, but outside of that, yeah, that's what the research paper said and I thought that was pretty fascinating.

[00:05:07]
It's on a kernel level and it has to do with load. Um, I should have linked that paper, uh, it was pretty fascinating, but um, yeah, they're not deterministic, so like, we can't really test them. What we can do is we can try to take this qualitative output from an LLM and quantify it. And then just measure it, right? It's like you don't test analytics, you measure analytics and you watch it, you see what happens.

[00:05:39]
Does it go this way or does it go that way? It's a metric. It's not, uh, it's not a boolean, it's not yes or no. Although some of the scores that we'll see can be deterministic and can be that way because of structured outputs, but it's something that you need to measure and something that you need to watch. The best way I can relate that to is like let's say you did um. Actually, snapshot testing is a really good one.

[00:06:04]
So snapshot testing is where like you take a picture of your web app, right? Visually and of the DOM, right? And then somebody introduces a change, that change should not change the UI. There should not be one single shift in pixels. So what you do is you take another screenshot, you put them on top of each other, right? And you see what pixels don't match, right? And the same thing with the DOM, you compare those virtual DOMs to each other which are just objects and you see if something is different on the measurements.

[00:06:31]
That's a snapshot testing. We do that tons of that at Netflix, tons of like thousands of those, right? They're really effective. But it's like something that you measure it's like you can set thresholds on like how different this should be and you can also say like oh that's expected, so here's the new baseline. The old snapshot test is old. I actually did introduce a change to this UI so this is the new baseline.

[00:06:54]
So I don't care if that if that failed because I actually added a new button and that should shift these pixels should be different. That's expected. That's a like an eval, right? You're evaluating the discrepancy between the previous run and the next run, you're not checking to see if something passed or failed because in eval you're comparing something you're comparing this thing versus this thing.

[00:07:20]
It's usually the previous thing versus the current thing and what is that difference? And typically you don't want that difference going down. You want it to actually. The size of the difference doesn't matter. You just want the latest one to go up, right? Like, and, in fact, if it's bigger, that means you just had a really huge breakthrough and something you did, whether you just upgraded to a new model or you unlocked, you know, big brain theory and made a new agent architecture and now like your, your scores are so high, whatever you did, that's even better, but like it's the comparison.

[00:07:53]
So an evaluation is a comparison between something. You know, versus what you're currently doing, whereas a test it's just like, yes or no, true or false. Does that make sense? Like the eval will have tests in it. We will have what you would consider test in the eval and in fact you can actually use something like Vitest, which is a testing framework that I use a lot in my JavaScript projects, kind of like jest but way better to actually run evals, um, and I was gonna do it, but like, there was no GUI for it, so I was like, yeah, I wanna impress them.

[00:08:28]
I want them to like me, so I'm gonna use something that like has a GUI, so I went with that, yeah. OK, um, and, you know, because you're measuring against something else, you can look at regressions. You can get that confidence in deployment. You can feel pretty confident like, hey, we ran these evals. You know, on thousands and thousands of simulations, real user input, fake synthetic data that we came up with, golden data sets, negative data sets, all these things, and our evals are sitting pretty at like 90%.

[00:09:01]
So I feel confident in production that this thing's gonna do well. Whereas if this thing was sitting at 40%, I would not be confident, right? It's like test coverage. That test cover says you only tested 30% of your code. I know you're scared, right? I know this keeps you up at night, right? It's like that. So it's like it's not a guarantee, but it's like, well, at least we're accounting for something, you know what I mean?

[00:09:24]
And then, you know, debugging because part of evaluation, it's kind of separate but very similar, it's like tracing. So we've been logging everything to kind of see what's going on, but now we're gonna add some tracing to it so we can actually see what's actually happening on a step by step basis and we can like follow that tracing and go from there, right? So, basically without evals, you are literally flying blind.

[00:09:46]
There's no way you can make a better agent without evals, you just can't. I've tried it, uh, and trust me, there's like a whole thesis, there's like a whole thing of like you know, based off vibes, like that's where vibe-coding came from. Vibe-coding came from like people just making agents off vibes, like, uh, I typed in 3 prompts, it worked pretty good. Ship it, right? And it's like, one of the customers types the exact same thing and it dies, and you're like, damn, that worked on mine.

[00:10:11]
Like I can't tell you how many times when I was running my company where we were just going off vibes and like I would do something and it was perfect. It was like, oh my God, we figured out how to make the perfect agent. And like one of my co-founders would try to like, this is shit. I'm like, what'd you type in? And they're like, the thing you gave me. And I'm like, no, that worked for me. I did it 10 times in a row, it worked.

[00:10:34]
Uh, it didn't work for me the first time. I'm like, there's something wrong with your computer. There's something wrong with your computer, use my computer, it's you, right? And then, no, it was just the non-determinate nature of the thing and like I wasn't measuring it, so I didn't know that there was something specific, uh, that I wasn't catching, right? So. Um, anyway, offline versus online evals.

[00:10:56]
The other thing with evals is for the most part until you get to like a production environment, you're gonna be running evals locally offline, right? You're not, these aren't, I guess that's very similar to test, right? Like where you just, you run them locally on your on your computer, you run them in a CI environment, they're not like things that they're not live. They're not like some data came in and we now need to run an eval right now.

[00:11:21]
Like you could do that with online evals, but offline evals are very similar to test. You just run them locally. In fact, most of the time you're probably gonna run them locally. You wouldn't even need a CI unless you really were like. I don't know, you want to block on regressions or something, you could, you could use a CI, but for the most part you'll just run them on your machine, uh for the most part, um.

[00:11:42]
Just like tests, very simple. This is what we're gonna be making today. We are not gonna be doing online evals. Online evals is like, all right, we got a whole system, you use a chat app and it's like a thumbs up, thumbs down on the output. Right? They're like, give me a thumbs up or a thumbs down if you like this output, like right next to the chat block that that thing generates, right? OK. That's for their online evals, right?

[00:12:07]
Or at least that's for collecting data for a data set that they will then eval later. When you put thumbs down, they take that output plus your input and they go add it to the data set of being like this was bad, right? And then they either have like some human-in-the-loop person who's like an expert that looks at it or they have some other LLM that looks at it or they just start evaluating it and then when they see it fail, they'll be like, uh.

[00:12:29]
Why did that fail? We have to come up with a hypothesis on why this thing failed. Is it because they asked something that our agent isn't supposed to do and we expect that to fail and and that's the case, screw them. Why are they asking that? Let's change our marketing and make sure people aren't asking for that cause they're using it wrong. Or was it supposed to do this and it still failed? And if that's the case, why did it fail?

[00:12:56]
And then that's where you start implementing changing the prompts around, the descriptions, uh, maybe it's an architecture thing, maybe it's, uh, we don't even have tool coverage for that or whatever, you gotta like start like experimenting with it. But then you gotta eval those changes to make sure that it does actually make that thing, that input output pair pass, right, so. That's how it is. You're making experiments against data and online eval is like you doing that as things come in, right?

[00:13:30]
So that's way advanced and for most people you don't really need that and typically you would use like an LLM as a judge, which we will be using today. Uh, to have an LLM judge. Um, whether or not something is good or bad, which is also nondeterministic. Uh, but yeah, this is an online evals is good for like catching real world edge cases. It is super expensive, so you don't want to run it on every input, but maybe you sent like if you're like Gemini, you're Google, you get millions of hits a day, multiple millions, hundreds of millions, whatever, you could sample less than 1% of that to go towards like evals live, right?

[00:14:02]
And that's enough for you to like have a decent sized data set where you have like 100,000 engineers working on evals, and that's all they do. Right, yes. When you see failure, are there certain cases of like system prompt whatever you find like this is the cause that like will most likely bite you, or is it pretty all over the board? The more variables you have, the more all over the board it'll be, right?

[00:14:26]
Like if all, if the only inputs you have in our case is a simple tool loop of like, we have tools, each has descriptions. Their inputs have descriptions, um, we have a system prompt. And then we have chat history. Those are, and then we have the model itself, right? And then obviously I guess you have hyperparameters like temperature and stuff like that, but let's just say throw that out. That's 5 variables.

[00:14:48]
So those are 5 different possible things that you could change to influence the results of your evals. Now let's say you have something more complex. Sub-agents, agents that talk to other agents, OK, multiply that. And then you have like a different reasoning logic. Uh, we're not implementing a tool loop, we're doing like chain of thought or we're doing React, which is not to be confused with React JS but like reason and act where we plan steps before we execute and then we have a planner and an executor.

[00:00:00]
OK, that shit gets complicated, so the more of those variables you have, the more things you gotta like, you know, do this to see if the evals go up, right? It's kind of guessing and it's kind of not because it works. All you have to do is know all the different things you can change and how to change them. But you won't know to change them and what to change them for if you don't have the data in the first place.

Learn Straight from the Experts Who Shape the Modern Web

  • 250+
    In-depth Courses
  • Industry Leading Experts
  • 24
    Learning Paths
  • Live Interactive Workshops
Get Unlimited Access Now