Lesson Description
The "Single-Turn Eval Executor" Lesson is part of the full, AI Agents Fundamentals, v2 course featured in this preview video. Here's what you'd learn in this lesson:
Scott explains creating evals using data files with input-output pairs to test AI tool selection and improve tool descriptions. He walks through making mock tools and a single-turn executor that uses conversation history for dynamic evaluation.
Transcript from the "Single-Turn Eval Executor" Lesson
[00:00:00]
>> Scott Moss:All right, let's make our first evals and I'm going to walk and talk at the same time on this because there's a lot to cover. It's not too bad. I think it makes sense once you do it, you know. So in the data file we have some data that I didn't want us to have to write. We're not going to sit here and write JSON, but these are just, remember I said we can make fake data, right? This is the data that we make.
[00:00:25]
This is the data that I made. This is the data in my head. I'm like, I think these are the things that I want to eval against, right? Given our agent's going to have file system, web search, a bash tool, a fake version of a code execution essentially. What are some things that I think people might ask for and what I wanted to do? So I wrote some input-output pairs for that, right? That's what I did in these files.
[00:00:48]
You can take a look at them and they try to solve different use cases. So I broke them off by like multi-turn, we'll get to multi-turn later, but a single turn I'm just checking that it selects the right tool or even the right tools plural in the right order based off the prompt that I gave it and I'm going to eval that. That's what I'm trying to score. I'm just like, if somebody asked you this, you better pick this tool, given the descriptions and the things I gave you.
[00:01:14]
And I would expect you to pick this tool, then this tool, then this tool, and if you don't, why? Given this prompt, right? Like if I literally wrote a prompt that said, hey, read this file, then write this file, then delete this file, I would expect you to run those three tools in that order because that's what the prompt said. And if you don't, that would probably tell me that my tool descriptions are bad, so I need to go fix those, right?
[00:01:39]
So that's what this data is for. You can read through them. But yeah, we have this data prop, this is going to get passed to some executor. We'll talk about executor in a minute. The target, this is basically like what we expect it to be, right? So I, when this prompt is passed in and I'm going to give the LLM access to this tool called run command, which is a bash tool that we're going to make, we're mocking out for the eval.
[00:02:08]
I expect the run command tool to be called and this category thing is just so me personally, I know that this is a secondary use case, not a golden use case where I'm like we for sure need to solve this. This is more like a secondary use case where it's like if our agent figured this out that's really cool and I'm glad that it did, but this was pretty generic. Can you start the development server?
[00:02:33]
Without enough context, I don't know if it would have known to use the terminal command, you know, to, you didn't tell it anything else, you didn't tell it what the command was. You didn't add any context into the agent that says here are all the commands that you can run. So if it tried to do a terminal thing, sure, but I'm not really expecting it to, so I'm going to say this is a secondary thing.
[00:02:56]
This is not like a primary use case that I think our agents should solve for, but it's cool to measure to see that our agent is smart enough to handle secondary things. This would indicate that we might want to build some features around like better context gathering to help with these secondary things, and maybe we can make these secondary things primary things because they're so good at the context.
[00:03:22]
It gets complicated, but there we go. Okay, so we want to make some executors. What is an executor? An executor is basically just, it's kind of like what we put in the run file, that's an executor, it's a runner, right? It takes in one of these pieces of data and it does the thing. In this case, we're only doing a single turn with all these different variables and it's going to execute it and it's going to get a result.
[00:03:50]
It's literally, it's an implementation of the runner, right? So we're going to have, you can think of like this is just a variation of the runner that we're making and that runner will change depending on what prompt we give it, what system prompt we give it, what tools we give it access to. It's just a different variation of the runner. That's what an executor is. I generalize them, and we really only need like two of them.
[00:04:14]
We need like a single turn executor. This is an executor we can use for single turn evals. They have like a multi-turn executor. This is an executor that we can use on like a full agent task, multiple turns, multiple loops, right? So let's start with the single loop executor, so what we're going to do is go to evals, excuse me, we're going to go to executor. Make sure I have my notes pulled up here.
[00:04:43]
And like I said, it's very similar to the run function, so we need to import a lot of that stuff. We're going to import generate text, we're going to import, let's see, what import step count is, we're going to use that. Import tools, so because right now we don't have any of these tools and realistically you probably don't want to run these tools in eval because I'm only evaluating if you picked the right tool.
[00:05:11]
I'm not actually going to execute these tools, so I don't need to import any real tools. That's why we're writing the evals now even though we don't have any of these tools that we're evaling for. I don't need the tools to be done. I'm just evaling on the descriptions of what the tools will be. And that can help me change them in the future. The tool itself doesn't have to be done. It's like writing a, it's like TDD like you're writing a test for a function you haven't made yet, but that doesn't mean you can't write the test for it because you know what it should do.
[00:05:40]
But in this case I'm never going to execute the tool. I don't want my eval to have to interact with the file system and be slow like that. That is a unit test that I can write for that execute function. I don't need AI to test that functionality. That's a unit test. Just write unit test for that code. I just need the AI to be eval on the thing that's supposed to do, which is like picking the right tool.
[00:05:59]
So in that case, I don't actually need the tool. I just need the things that the AI sees, which is just the description of the tool, right? So we're just going to mock those out if that makes sense. Right, so we got that and then I'm just going to import this type right here called tool set like this, and this is from AI. The next thing we want to do is just import OpenAI from, oh, you don't even have so many things cut open.
[00:06:36]
There we go, from there. And then Zod from Zod, Z from Zod. Cool. So we already have these types. I'm just going to move them down here. I don't know why I had them up there. There we go. These are just types that are going to help us. And the first thing we're going to do is just make these mocked out tool definitions, right? We've already made a tool before, right? We had a date time, we know that there's a description, right?
[00:07:01]
There's a schema, there's a function that runs, this is a tool definition, right? We just need to make some fake ones. These are not the real ones we're going to use, but they represent the descriptions that we want to eval that we might use if the evals are good, right? Depending on what our strategy is, is it hill climbing? So we always start with this, or is it starting with some data from somewhere, right?
[00:07:29]
But this is, this is where we start, right? So in that case, I'm going to say tool definitions. This is just some mock stuff here, and I'm just going to make this an object. The tools that we're going to need are going to be read file. We're going to have that, we're also going to have like pretty much anything related to file system stuff, so we got read file, write file. List files, I believe is what we're going to have.
[00:08:00]
We're going to have delete file. And then what else do we have here? Oh yeah, and then we're going to have run command, which is like run the shell command, run the bash script, whatever, right? So we're going to mock all these up. We're not, we're going to make these, right? But they just need what the tool function will take in, right? In this case, a description, an input schema. This is the execute is always optional.
[00:08:37]
You don't even have to pass this in. You don't need that. It's always optional. Okay, so for read file, let's give it a description. What would you tell the LLM about this tool to help it select this tool, right? So I would say read the contents of a file. That's it. I think in the notes I have read the contents of a file at the specified path. Sure, let's do that because you got to pass in a path, sure.
[00:09:03]
But that's also room for improvement, so we got the description there, right? And then we got the parameters. This will just be a Zod schema, just like you would with the input. And then you probably already guessed it based off the description, it's going to take in a path, which is going to be a string, and then I can give this a description of being like the path to the file that you want to read.
[00:09:41]
I think it's describe, maybe, described. Yeah, there we go. So that's our mocked tool definition, right? We do the same thing for the other ones. So I'm just going to copy that. Write file. Same thing, you don't have to put what I'm putting, you can put whatever you want, it's your eval, but write file I'm going to say write given content to the file at the given path. Right, and I'll say path, the path to the file you want to write to, and then I'll say, what's the second one that this thing takes?
[00:10:55]
Oh, content. And I'll say this is also a string. The, fucking spell describe. Yeah, oh, I'm tripping. There we go. The content you want to write to the file. Right. List files, very similar. List all the files in a directory. Boom. Pretty simple. It just takes one argument, which is the directory in which you want to list the files. Oh. Pretty simple. Delete file, also really simple. Delete a file at the given path.
[00:12:08]
The path to the file that you want to delete. Pretty easy and run command. I mean, you get creative with this. I'm going to say executes a shell command and return its output. So yeah, pretty damn scary. Command, the shell command to execute. That's it. There's no wrong answers here. You can describe these however you want. You can put A, B, C, D. Like your evals are going to be bad, but that's okay, that's a baseline.
[00:12:50]
You know, you'll get a raise if you show that, hey, your evals went from this to this, right? So might be a good strategy to start off that way actually. Okay, so these are just our mocks. And now what we want to do is actually make the single turn executor. So we'll say const single turn executor. It's an async function. Again, this is very similar to the run function, except it's going to take in that data object, which is going to be the eval data type and remember that's one of these.
[00:13:25]
It's just going to take in like one of these, or actually, I'm sorry, like one of these. Since I did this. Right. So if we hover over that, we go to it. Just this. So it's like a prompt, optional system prompt, some tools, optional config for the model, right? Just like all the things you might want to change across some experiment. Given the simple nature of our system, if you had a more advanced system, you would have more things you would want to change there or you have to go make an evaluator for every single variation.
[00:13:56]
This is just my way of making one evaluator that can take in options and then in your data set you can pass in those options, but that also makes it hard to pass in live data because the live data wouldn't have those options. So there's trade-offs for a live eval, this probably wouldn't work because the live data wouldn't have, oh yeah, this is a negative one or a golden one, and then here's the model to use, like it wouldn't have all that, you know.
[00:14:23]
Because none of that stuff is in the wild. This is stuff that I put there. This is human annotated. So that's why it's like you might have a human pass over the data when it's coming in live before it goes to an eval, and a lot of eval tools have the ability for you to have humans review stuff, right? That's a group of people's job. I mean, AI is built on human review, let's be real. Okay, cool. So we have that.
[00:14:52]
And first we want to do is we want to build up a conversation, so what we can do is we can say messages and what this is going to do is we can say build messages and what we can do, we can pass in data. So what does this do? So this is going to introduce something new that we haven't covered until we get to the agent loop, but because we're doing it now, I'm going to tell you about it. So far we've only been passing in a prompt to generate text, right?
[00:15:26]
Like if I go to run. We've only ever been passing in a prompt. Yeah, that's cool, but if you want to pass in a conversation, you wouldn't use prompt, you would use messages. Okay. Messages is an array of objects, and these are in order from, you know, first to last, as in the first is the oldest message, the last is the latest message. It's a conversation between at least two parties, usually it's the assistant and the user, so if we go look at a model message type so you can kind of, of course when I click on that that was going to explode into another bunch of different subtypes.
[00:16:04]
Okay, this is at the simplest, it kind of looks like this. There's a role, the role can be like user, which is a user, assistant, which is the AI. And another one is probably like tool call, which is a tool call, right? There are others and other models support others, but that's the gist of it. It's an object that has a role property on it and a content property on it. And that content is the content of the message.
[00:16:30]
It's usually a string. If you have a model that supports it's multimodal, it might not be a string, it might be an object of attachment, file attachments, right? It could be that if this was a tool call role, the content would be the name of the tool, the ID of the tool. This is a tool call response, the content would be the results of that tool call, right? So it's not always a string, but for the case of like AI and humans, it's most likely a string unless you're doing structured outputs.
[00:16:58]
So yeah, messages, that's how you can pass a conversation and what's cool about it is like you can just put whatever you want in there like it's just an array of objects, so if I want to fake a conversation history, I'll just put those objects in there. So when the agent sees it they can see a conversational history. So if I want to prime my eval up to a certain point, like if I want to eval, I want to see what happens if my agent picks the right tool after the user had asked it three times.
[00:17:23]
So I just faked that in the conversation history. I literally put a user message followed by a fake AI response, followed by a user message, followed by a fake AI response, followed by a user message and then I want to eval that AI response right there. I want to eval what it says after that, right? I can just put whatever conversation state that I want in that messages array to prime it to be whatever I want, right?
[00:17:53]
And this is where online evals are great because you could just take someone's conversation and just plug it in there. I'm like, all right, let's eval that. What happened? Right, so it's pretty good. Now, at the same time, I don't use this in production. Not because I think it's bad, I think it's great. I personally don't use a traditional tool loop. Not because it's not good, like I think it solves 90% of the problem, but I think there's a lot of value in creating your own tool loop.
[00:18:21]
But we could talk about that later. For the most part, that's fine. So anyway, let's get back to our executor. So we're going to build messages. What this does is it does exactly what it says. It takes the message from that data and it puts it in a message format. You can see it puts the system prompt in there and then it automatically takes that prompt that's on that data object and formats it as a user message.
[00:18:40]
If you give an LLM a message array where the last message is of type user, it will then respond as if you, as if you type something and hit enter. If the last message was type assistant, it's not going to respond to that. It's going to think that it's waiting on you to respond because it said something last, so you should be saying something, so it won't respond to that. Does that make sense? If the last message is always type user, it's always going to respond to that.
[00:19:35]
If the last message is anything but that, it will not respond to that, because why would it? You didn't ask it something. So quick little hack. So we're going to do that. We'll make a tool set, so we'll say tools. It's tool set, it's empty object. And then from here, we're just going to go ahead and generate these mock tool set, so I'll say tool name of data.tools for, there we go. And then from here we just want to get the definition from the tool definitions.
[00:20:15]
We pass in the tool name like that. And this thing is, or sorry, what do you want? Let's see. Yeah, oh my God, TypeScript sometimes, man. Okay. I will, I will type you if you want to be typed. I will do that. And I'm going to put any. Yeah, yeah, get out of here. I'm not doing all that. It's too much work. This is what I use AI for, but I don't like, I had, I turned it off for the course, so I'm not doing all that.
[00:20:58]
What are you talking about? Get out of here. So yeah, and then basically, you know, if there is a definition, then we need to actually create a tool, so we'll say tools with a tool name and it's going to equal a tool from the tool helper and we're just going to pass in our description, which will be that description. Yep, and then we're going to pass in our input schema, which should be that parameters.
[00:21:35]
Is that what I call it? Yeah, parameters. Cool. So we're just dynamically making tools based off those things. Nothing new, like we've seen this before. Like you could put an execute function, right, but we don't need that. Okay. And then we're going to generate some text, so we'll say text, or, well, I guess in this case we want tool calls since we're testing tool calls, so tool calls equals await generate text.
[00:22:12]
Here we want to pass in a model, you know, this is where you could just have evals that just try out different models because you're like, we're comfortable with what we're doing, but some new thing came out, so let's test this new model, you could do that, but I have it set to where the data itself can supply the model if it wants, so you can pass that in or we can just default to something. I'll just default to mini, like that.
[00:22:46]
You could just do that. Passing the messages array. Right? Pass in our tools that we just made. I'm going to say stop when, and then I'll say step count one, just to force it. Even though I think explicitly it already does this, but I don't know, just going to force it. And because I'm using O1 Mini, which I think doesn't do temperature, oh no, this will break on this model because it won't break.
[00:23:11]
I think you'll get a warning from the AI SDK saying like you can't do temperature on a reasoning model and O1 is a reasoning model. So instead what I'll do is I'll just say if you did supply the temperature and you added the model, then like you kind of already know what you're doing, but otherwise put undefined. So I put undefined and this whole field just gets omitted in the API call, so OpenAI never sees it in the first place, right?
[00:24:05]
When you JSON stringify an object, anything with undefined gets omitted. So that will get omitted over the wire in HTTP. So then I get my tool calls, which would be, let's say, calls equal tool calls.map. We want to get all these tool calls like this. And we're just going to convert them to something like this that says like tool name is tc.toolName and then args, let's say args and tc. If there is, give me them args and the tool call, otherwise just give me a blank object.
[00:24:45]
Cool. And then what we want to do is we'll get the names, tool names equals tool calls map, get our tool call, and then we'll just get the tool call.toolName like that. And we just want to return this really cool object that just has our tool calls, tool names and our, basically I'll talk about this in a minute, but basically did you select, were any tools selected true or false, essentially. So tool names.length greater than zero, because if there are tool names in here, that means you picked the tool.
[00:00:00]
So I just want to know quickly, did you pick a tool? Yes or no? Because that's something I can quickly score on. It's just a little helper method here. There's really nothing crazy here, it's just some meta code around the run function. We just made it more dynamic. I'm not really doing anything crazy here, it's just a dynamic run function that takes in some objects, right? So we got that.
Learn Straight from the Experts Who Shape the Modern Web
- 250+In-depth Courses
- Industry Leading Experts
- 24Learning Paths
- Live Interactive Workshops