Multi-Turn Evals

Netflix

Lesson Description

The "Multi-Turn Evals" Lesson is part of the full, AI Agents Fundamentals, v2 course featured in this preview video. Here's what you'd learn in this lesson:

Scott explains multi-turn evaluation, where the agent runs with message history and tools to judge outputs. He highlights its role in assessing complex tasks, user experience, and using language models to evaluate unstructured outputs.

Join Now

Preview

Transcript from the "Multi-Turn Evals" Lesson

[00:00:00]
>> Scott Moss:Let's move on to multi-turn evals. So again, that's the agent. Everything from here is just improving it through prompt engineering, tool coverage, architectural changes. That's it. That's the agent. So the foundation is done. Everything here is what I would say is the fun part, at least for me. This is the fun part. The evals are pretty fun. I would say the most fun I have making agents would probably be the tools, like figuring out the really cool tools to make and stuff like that.

[00:00:27]
That's pretty fun. But that's mostly it. And then, you know, we'll have some other stuff at the end, but that's an agent. So we already know about evals single turn, let's talk about multi-turn. So the single turn ones were more about like tool selection, you know. Did you pick the right tool given this prompt? Did you pick the right set of tools in the case of parallel tool calling in the right order?

[00:00:53]
You know, that's essentially how we eval that, because we just want to make sure that the agent on a single turn basis can decide on the right action to take given the current conversation, which might just be one prompt from a user. It might be a full conversation that we simulated with a messages array, but whatever it is, we just want to make sure we evaluate what is your very next decision on a step by step basis, right?

[00:01:22]
That's the single turn. Multi-turn is like let the agent run on a loop completely. We're not looking at the individual steps. We're going to give it a message history, we're going to give it a set of tools, we're going to let it run completely on a loop, and then we want to judge the output based off of the prompt, based off of the tools. We want to judge the output. So that's the theory behind that, right?

[00:01:45]
That's just how I've done evals. These are just two different types of many different types of evals that I typically do. But usually when I'm developing an agent, especially something that's very sophisticated with tools and stuff, I usually start off with these two things. It's just like my baseline. So, you know, why does multi-turn even matter? Well, as you saw and as you just did, we built a loop.

[00:02:12]
At the end of the day, if you're using an agent, do you care what order the agent went in to do things, to get the job done as long as it got the job done? Probably not. I mean, there are some edge cases where it's like, oh wait, don't run that one. That one looks at my email. I don't want you looking at my email. So there might be some things you don't want it to do, but as long as you got the job done and it didn't take too long or it wasn't expensive, if you're tracking costs, you only really care about the output.

[00:02:43]
So the single turn stuff is a good indicator that it can reach an output more efficiently. So that's more, you know, for us. The multi-turn stuff is also for us developers, but it's also very user facing. This helps us eval what a user might experience because they put something in, they wait, they wait, they wait, and then they get something out. So we're really evaluating what it would be like for a user to use this system.

[00:03:07]
So that's why it matters, because agents don't work on single turns, right? They receive a task, call a tool, process the result, figure out what to do next, call another tool or respond, and they keep doing that until they're done. So we need something to check that whole system, right? So some of the things that a single turn eval would miss in the case of a full agent would be like, hey, did the agent pick the first tool correctly, but the second tool is wrong?

[00:03:37]
Because on the single turn there's only one chance to pick one tool unless it's parallel, but you can't pick another one because we're only testing one. So we want to see, do you get it correct over time, right? We can check that. Does the agent get stuck in a loop, right? Like if it picks a tool, we send back a result, it sees the result, it asks for the same tool, we send back the result, it asks for the same tool.

[00:03:59]
It's just stuck in a loop forever and forever. That can happen. We want to check for that. Does the agent misinterpret tool results? So we'll have a tool, we'll send something back. Let's say the agent has a tool that says read image, right? And we send it back in a base64 encoded stream. Does the agent understand that? Can it interpret that or not? We want to know if this agent can understand base64 encoded images or not, or do we need to figure out another way to give the agent access to images through something more native.

[00:04:33]
Does it give up too early? As in, you know, the prompt was very vague, the tools are super granular, so the agent has to do a lot of work to figure out what tools to call in what order and what steps, and that work was just so exhausting that it just gave up. It was just like, you know what? I don't want to do this anymore, or it'll lie. I'm like, I can't do this, or it's impossible. We want to see that, right?

[00:04:55]
And believe me, it happens. Or the agent just doesn't know when to stop, which is something we used to see three years ago in the early days of people trying to make autonomous agents is that it would just keep going forever and ever and ever. It literally wouldn't know that it had all the information that it needed to answer the question and it would just keep asking for more information, AKA more tool calls.

[00:05:20]
So it would just keep going and going and going. So a lot of the early AGI experiments were definitely capped at like max turns, right? Like, put a max turn in here of like 20 steps. After that, stop, because it will just run away. So these are things that we want to eval, right? You can't test that on a single turn. Now, because we are evaluating outputs, unless we're doing structured outputs, which we didn't talk about, but they are exactly what they sound like, we can have the LLM output JSON.

[00:05:48]
I guess technically we have already done structured outputs. A tool call that comes back from an LLM is a structured output. It's not text, it's an array of objects, right? That's structured. It's always going to be that schema. You can tell the agent to respond that way too. You can give it a schema the same way we give a tool a schema. We can give an agent a schema and say, hey, the response that I expect from you should be this schema.

[00:06:14]
So unless it's that, as in it's just human language, how do we eval that? How do we eval human language? The only way to eval human language is with intelligence. I mean, I guess, you know, obviously you can do pattern matching with regexes and things like that, but LLMs are nondeterministic. If you're testing things with regexes, I would imagine the prompt that you gave the agent or the assistant prompt you gave the agent is screaming at it to say you must return this word every single time, because otherwise how are you going to test that pattern matching successfully?

[00:06:48]
At some point you're going to have missed your eval is not good, so you're not going to have good results. So really the only way to test outputs that aren't structured reliably is to have some sort of intelligence. That intelligence is either a human looking at it, like a subject matter expert, or another LLM. That's it. So enter LLM as a judge. You will hear this a lot. LLM as a judge is exactly what it sounds like.

[00:07:22]
We're going to use an LLM to judge the output of our LLM, right? And it's just LLMs all the way down. It always has been. So we'll create a judge that understands the task, it understands the rules of our eval and it understands the scoring rubric and what to look for, and it just outputs a score for us, right? And typically what you would do, in our case it's very easy because our agent is just a single LLM on a loop, so it's super easy.

[00:07:49]
Typically what you do is you have the judge just be a more powerful model of the LLM you're using in your loop. Now if you're already using the most powerful model, then you would still use the most powerful model, or you could do like, you know, you can have multiple judges and they can collectively do scores and you can average them out. There's no wrong answer here. You can do whatever you want.

[00:08:09]
You make it up as you go, right? There's no wrong answer here. It's a science, it's also very situational. You're just trying to get some baselines here, right? So yeah, we could do things like, you know, given this task and the tools tool results, is this correct? We can have the LLM score for factuality. Like, we put a bunch of context in the messages array, and then we ask our agent a question based off the context that we gave it, the question that we asked it, and the answer that the LLM spit out.

[00:08:39]
Is that factually correct? Is that information that it said actually in the information that we gave it? So we could do that, you know. Does this even make sense? Does this make logical sense? Or, you know, is it lying? Did the agent accomplish the goal based off of the conversation that we gave it, right? So it's pretty flexible, you know. Semantic understanding, that's the intelligence part that I talked about, and the most important thing is that it handles variation, so that's why we have to do that.

[00:09:10]
Now, obviously this thing has limitations because LLMs cost money, they cost money in tokens, money and electricity. So if you're running thousands of evals and each one of them have LLMs as a judge, you're racking up costs. You know, evals might become your most expensive line item, more expensive than the users that are actually using your product. So you have to figure out some system. I think this is where using open source models hosted on either your infrastructure or some other infrastructure is a really good candidate for LLMs to judge, because those things typically are so much cheaper on a token basis, you know, and faster depending on the hardware you have.

[00:09:48]
So it really just depends, but there are different ways around that. Yeah, and then obviously gaming is a big one. I wanted to talk about that one because if you pick a really smart model, these models are trained on reinforcement learning as in goal achieving. They want to achieve a goal. So if they're really smart, they'll figure out ways to achieve that goal. So you have to be careful in how you create these judges, because if you give them a hint on what the goals are you're trying to achieve, they might just give you that because they can.

[00:10:17]
I'm like, damn, you know, actually the easiest way for me to achieve my goal is just to lie to you. So I'll just do that. It's so much easier to do that and now you need a judge for your judge. So you don't want that, right? So you have to be careful. It's kind of like, you know, that thing where they had in product management, the mom test, where you're trying to interview customers on using your product, but you don't want to give them bias and things like that, so you have to figure out a way to ask them the right questions without influencing them.

[00:10:45]
It's kind of like that when LLM is a judge. You kind of got to get them to do the thing, but you don't want to tip them off on pushing them one way. It's a little weird. Making it more reliable. Like I talked about, structured outputs, we already did it. Tool calling is a type of structured outputs, but we would have, instead of the judge spitting out some text which would defeat the purpose of the judge, because we're using the judge to look at some text that we can't judge ourselves without a human, having the judge spit out more text would defeat the purpose.

[00:11:24]
We need to quantify the qualitative output of our agent. So the best way to do that is to have the judge spit out some numbers, and we could do that with structured output and give it a schema. And like I said, use a stronger model. So the data strategy for multi-turn, it's not too much different than what we did on the single turn. It's the same thing. You have an input, this is what the user is saying.

[00:11:46]
You have a list of available tools that the agent can use. You have some mock tool results, so we mock off the tools and what they will return. This is more important because we are doing multi-turn, we will be executing the tools, but not the tools that we're going to make. I still don't want our emails touching the file system and going like they would take forever and it would just be annoying.

[00:12:04]
So we're still going to mock them out, but in a single turn they were never executed because all we cared about were the descriptions, because on a single turn we don't feed the results back into the agent, so we never needed to execute those tools. But in the multi-turn we do need to feed them back into the agent so it can go to the next step. So those tools will be executed, so we're going to mock out the tools in our case.

[00:12:26]
And in the case of another agent, maybe you don't need to mock them out because the tools are very inexpensive, but in the case of file systems and shell scripts and internet, we're mocking that out. I'm not about to sit here and wait for that, right? And then you also want the expected behavior, so what should happen, right? And then the evaluation criteria, so how to judge success. As you guessed it, because we're using LLM as a judge, this most likely would be like a prompt.

[00:12:55]
You would prompt the judge on, this is what you need to look for. If you see this, this is a 10. If you see this, this is a 0, right? And figure that process out. We talked about manipulating the messages array. So like, for instance, if you want to eval something like mid-conversation, well, in your data that we have, instead of just having the one message that the user says, we can go ahead and prime the messages with a conversation that already happened, right?

[00:13:22]
Make sure we leave off on a user message and then have the agent take over and see what happens, right? So we can prime it to be like, let's have a conversation where the user said this, the agent said this, the agent asked for this tool call, here's the fake response, the agent asked for this tool call, here's the fake response. The agent responded with this, the user did a follow-up question, go.

[00:13:42]
Now we're evaling follow-up questions, right? So we can fake all of that if we want. Now we're like, all right, I want to see an eval because we've noticed that a lot of people after they ask for something like this, there's always a follow-up question and the agent always fails there. So let's set that up with this messages array, right? So it's like a mid-conversation thing. Mocking tool results, I mean, it's pretty simple.

[00:14:06]
You just put a results thing in there, put whatever you want, a string. And then expected behavior. So very similar to what we have the single turn, so we have the tools. In this case, because it's multi-turn, we can have an expected tool order. Forbidden tools, we have that as well. These are tools you should not use in the case of like a negative prompt that we don't want to handle. And then output quality, this would be for the judge, right?

[00:14:36]
So the judge in this case would say, hey judge, here was the original task. This is what the agent was supposed to do, and here are the mock results. Does the response make sense given this task, given the results that it got? Did it, and here's the result. Does it make sense, right? And so the judge can look at that. So we can combine all those evaluators, average it out, and we can get some score, right?

[00:00:00]
So some of these are deterministic, like, you know, tool order, tools avoided, but for the semantic stuff we need to judge.

Learn Straight from the Experts Who Shape the Modern Web

250+
In-depth Courses
Industry Leading Experts
24
Learning Paths
Live Interactive Workshops

Get Unlimited Access Now