AI Agents Fundamentals, v2

Coding a User Message

Scott Moss
Netflix
AI Agents Fundamentals, v2

Lesson Description

The "Coding a User Message" Lesson is part of the full, AI Agents Fundamentals, v2 course featured in this preview video. Here's what you'd learn in this lesson:

Scott walks through building a multi-turn executor with mocks, structuring messages, and collecting tool calls and results for evaluation. He emphasizes experimentation and fine-tuning to optimize prompts and model performance.

Preview

Transcript from the "Coding a User Message" Lesson

[00:00:00]
>> Scott Moss:That's the system prompt. In a chat app, the agent doesn't typically just send you a message out of nowhere, you have to send a message first. So typically the first message in the messages array after the system prompt is a user message. So we'll say user. And then from here we'll just say content. Here's the task that the user was given. Right. The task was target.originalTask.

[00:00:37]
And then we want to add a bunch of other context in here, right? So we want to say that was the original task, here are the tools that were called, right? We can put that in here as well. This will just be, we have to stringify this because everything has to be a string, so we'll say. Output a tool called order, so you got that, we also want to give you the tool results for each single tool.

[00:01:07]
So we'll do that as well, provided and probably want to JSON stringify that as well. And we just want to do the targets.mockToolResults. And then we want to show you the agent's final answer, which would just be the output.text. Right? And then just in case you don't know what you're doing, you know, evaluate if this response correctly uses the tool results to answer the task.

[00:01:55]
There's so many ways we can represent this. We could put this in the system prompt and see if that works better. We could convert this to XML. Some models prefer that over JSON stringifying. We could try to clean this up and put some better spaces in here. Like, you'd have to experiment. OK. So once we get that result, what we can do is we can return the result.

[00:02:31]
It's always a dot object. When you do generate object, that object is going to be in the shape of the schema that you gave it, so there's a dot score. Right, so we could say object.score, because it's 1 through 10 and we want 0 through 1, we'll divide by 10. Just another LLM call. Pretty self-explanatory. The next thing we need to do is make our executor, because right now all we have is the single turn with mocks.

[00:02:59]
We need to make it multi-turn with mocks because the single turn doesn't, it's not an agent. So we're essentially just going to do something very similar we just did in the run function, but we're not going to stream, we're not going to do any of that stuff. But we do need to, you know, have our loop that does the thing and collect all the information that we need so we can report it.

[00:03:22]
We're not going to make the loop from scratch though, because we can just use the Vercel AI SDK to do the looping, but we'll have to collect the tool calls so we can show the judge in the format that we want. So inside this executor file we're going to do just that. So we'll say export const multiTurnWithMocks, async is going to take in just the data, which is just going to be a multi-turn eval data.

[00:04:00]
And from here we just want to build the tools. I have this helper function called buildMocksTools like this. And it's going to do exactly what it says it's going to do. It's going to make sure that these tools are returning and that they're formatted with the correct tool function, right? So same thing we use for the single turn thing just make sure that the tool format is there and in this case the execute function returns whatever the mock return for that tool that we're going to set up that mock return will most likely be inside of our data so we can control that mock on an eval basis.

[00:04:47]
And then we want to go ahead and create our messages, so let's do that. Got our model messages type here, this is going to be an array. And if you already have a conversation that you want this executor to start with, we can say that. So if like, hey, you already have some messages, that's fine, we'll start with that or we'll just make one right here. We are going to make one.

[00:05:13]
Then, you know, this is where you probably want to get a little more sophisticated with your eval system so you can try to like reuse code and not have to write things twice, but, you know, at least I have the system prompt that's using the same one as our run, our runner, so that's the same. And then we need to go ahead and kick it off with a user message here, right, and the content of that which will just be data.prompt.

[00:05:39]
Putting an exclamation to tell TypeScript that that's always there. And then from there we just want to generate text. So we'll say result equals await, generateText, get our model. Let's, you know, do something very similar to what we had with the runner, but you can also pass it in that way we can do different model things. Test this model versus that model, otherwise I'll just do GPT-4 Mini, which is what I've been consistently using.

[00:06:18]
So we can do that, passing our messages. Passing our tools. Oops, you need to spell that. And then because we don't want to build this loop ourselves, we'll just say stop when, then we can say stepCount and we want this to be controllable by the data as well, so we can say if you passed in a maxSteps, then we'll honor that. Otherwise, you know, pick a number, 20.

[00:06:50]
Whatever is going to match the data that you're sending and the tools that you have that you think the steps would allow. OK. And then from here it's mostly just doing a lot of collection so we can pass to the judge the right format of things that it's expecting. So in this case, we will want to do is we want to get the tools that were called in order that they were called, the results of those tools and the final results of everything, right?

[00:07:22]
We want to collect all that and that way we can show that to the judge in the format that it wants. So we'll say allTools here. This is just going to be an array of strings. We want to get the steps, so we'll say result.steps, which is something that we get when we do the internal loop that gives us result.steps. We get access to the steps property on the result on generate text if we do the internal loop with the AI SDK.

[00:07:56]
So for each step, we want to get the calls, right? So we're going to say, OK, step tool calls because it might be parallel, so one step might have multiple tool calls, so we want to collect all of those so we can say step.toolCalls, and in this case, if there isn't, we'll just knock this out. And then for each one of those, we want to get the information from this tool call, so for this tool call, we want to say allToolCalls.push, we're collecting all of them, so put that tool name in there.

[00:08:41]
And then we're just going to return this object here, which is just a tool name, which is tc.toolName. And then the arguments, if there are any arguments. And toolCall.input, oh no, no, tc.args, tc.args. Excuse me. Otherwise, we want to do that. TC. Come here, tc, oh I'm sorry, and tc, right? Then it's tc.args. There we go. We're just going to collect all of those, we have to do this because we didn't write our own custom loop, so we have to go back and inspect all the steps and get the data ourselves.

[00:09:33]
So now that we have that, the next thing we want to do while we're still here in the steps, as we want to get the step results. And to do that, we'll say step tool results. So what did this tool return? I, yeah, I'll just call it just to be consistent with the notes, toolResults. If there were any results. No, don't do that. Thank you. There we go. Get the tool results, tr for short.

[00:10:20]
For each one of these, what I want to do is just return an object with a tool name and the results of that tool name. So tr.toolName and then the results, if there are results. Otherwise we can just say null. And then lastly, inside of this map, we just want to return this object here, which is just going to be, here are all the tool calls for this step, which would be stepToolCalls.length greater than 0.

[00:10:59]
Sure, if there is, then go ahead and put those in there, otherwise we don't care. So we got the tool calls, we got the tool results for this step. Oops, there we go. Tool results for this step would just be the same thing, but for the results, so it'll be a toolCallResults. ToolCallResults. Greater than 0. If so, the stepToolResults. Otherwise undefined.

[00:11:40]
And lastly, the text, if this was a final step, there wouldn't be any tool calls. There would just be text here because the agent was done, so click that. Cool, so all that and the steps map, and then outside of that, what we want to do is just get an array of all the tools that were used, essentially, so toolsUsed would be a new array. We can make a new set here, and that way it's like deduplicated.

[00:12:33]
ToolCalls. Like that, oh. Wait what did I call, oh, I called it allTools. Oops. I'll be consistent, allToolCalls. AllToolCalls, there we go. And then lastly, we just want to return the text, the results of this whole process, not just the step, but the whole agent. All the steps that were involved, all the tools that were used, and the tool call order, which should be allToolCalls.

[00:13:08]
Cool. So now we have everything that the judge is going to need to inspect all of this information and give us a score. What we just did is just some process that I came up with. It's not some standard or this is the best way. Every time I go look up inspiration on how to eval agents I've never seen the same thing twice so there's just a lot of ways to do this and it's because there's people using different technologies, different frameworks, different methodologies, different metrics, different evaluators, everything is just different, but the only thing that matters and that's consistent is that like you're running experiments against some non-deterministic system and there's a bunch of stuff in your tool bag that you can change to get the results closer to what you feel more comfortable with, right?

[00:00:00]
It's prompt engineering, it's fine tuning, it's model changing, it's hyperparameters, it's, you know, reasoning framework like what's happening in the loop, context management, all these are the different things that you can change to get different results.

Learn Straight from the Experts Who Shape the Modern Web

  • 250+
    In-depth Courses
  • Industry Leading Experts
  • 24
    Learning Paths
  • Live Interactive Workshops
Get Unlimited Access Now