AI Agents Fundamentals, v2

Strategies for Managing Context

Scott Moss
Netflix
AI Agents Fundamentals, v2

Lesson Description

The "Strategies for Managing Context" Lesson is part of the full, AI Agents Fundamentals, v2 course featured in this preview video. Here's what you'd learn in this lesson:

Scott explores strategies for managing context, including summarization, eviction, sliding windows, sub-agents, and RAG. He explains how RAG uses vector search to dynamically add relevant information for efficient LLM use.

Preview

Transcript from the "Strategies for Managing Context" Lesson

[00:00:00]
>> Scott Moss:Strategies for managing that context, I talked a little bit about this earlier, but like compaction or summarization, that's what we're going to do. So when the context gets too large or near too large, we can summarize the conversation so far and then we can replace the detailed history with a condensed summary. So it preserves key information but not all the details, right? Conversation can continue indefinitely, like you could continue this forever and it should theoretically never hit the context window if you do it right.

[00:00:25]
And you know, it's graceful degradation as well. Cons, like I said, it loses detail, summarization does cost tokens because you got to hit another LLM to like summarize all this. That costs you money, and you might lose like some important nuance because the devil is in the details, right? Like, sure you might have preserved key points, but the relationship between those key points you might have lost them.

[00:00:48]
So, you know, you might have preserved somewhere in our conversation that I talked about a dog, but you forgot that I mentioned that that dog attacked me. And you thought I talked about this dog because you thought I wanted to buy it. I wanted to get a puppy. No, that's not what we talked about, right? So that's something you might see. You might also see something like eviction or sliding window where it's just like drop all the old messages when you hit the limit and keep only the most recent in messages, right?

[00:01:17]
So this is simple to implement, no summarization costs, very predictable behavior. You don't even count tokens really, you're just counting how many messages you have, which would be naive. You probably still would want to count tokens and have a sliding window be based off of like token count and not message count because you could have one message with almost no tokens in it, or you can have one message that has like your whole context window in it.

[00:01:40]
So message count probably not the best way, but token count would be. Cons for that is like you will lose all context entirely. Anything that gets dropped off or evicted, never happened. It's just gone. There are ways to solve this with the UX that I've experimented with, which is like, you know, like in any chat app where you scroll up, it's like, you know, eagerly loading new things, lazily loading new things as you go up, so like only show the user what's currently in the context.

[00:02:08]
When they scroll up and it loads more, then load that into the agent history so it only sees what the user sees, so they're like in sync. So whatever you see on the screen, maybe some buffer on or off the screen, is what the agent sees in its context window, so that way it's in sync. So if you want to, but then if you start typing, you got to like harp and scroll it down so it can load up the latest, like there's some nuance to it, but there's ways around it, I think.

[00:02:34]
The problem with this is it can break multi-step tasks because if you have, you know, some loop that's running and it relies on all the history of the tool calls that it did and the results of those tool calls and you drop them off. OK, well, I forgot it did those things, so it might try to do them again, right? Or if you drop them off in a place of like in a broken state where because remember I said when you do a tool call, you have to respond with a tool call result.

[00:03:01]
If you broke it in between one of those, the LLM would just break because it's expecting a tool call result to be next, but now there's a user message that's next, it'll just break. So you have to smartly do it. The other one that's pretty popular today that you'll see everywhere is like sub-agents with separate windows. OK, well, if every LLM has its own context window, why don't I just spin up more LLMs?

[00:03:20]
Then I can have multiple context windows and then all I got to do is figure out how to get them to communicate with each other. That's a very common one that you'll see, it's a clean separation of concerns. Each task gets its own full context budget. Parent only sees the results, not the details. So when the parent, you know, agent that's spawning all the sub-agents, it's just going to see the semantic output of that task from each agent.

[00:03:46]
It's not going to see the tool calls and the tool call results and all that stuff, no, it's just going to like just tell me when you're done and give me the information that I need, that's it. I don't care about all the steps. So each agent has its own context window and you're preserving the context window of the main agent that the user is talking to. As you can see, the coordination overhead is high, results must be summarized anyway because you're going to have to take all the messages from each spawned agent and summarize them into one message somehow.

[00:04:22]
And it's just a more complex architecture, so. That's the whole thing. Then there's RAG, oh boy, RAG, Retrieval Augmented Generation. You've probably heard this, so oh my goodness, RAG. OK, what is RAG? RAG is essentially doing what an LLM does, vectorizing some content, and creating embeddings and putting them in a vector database. What does that mean? It's converting tokens into numbers, OK? It converts it into a multidimensional number.

[00:04:54]
What does that actually mean? It's just an array of decimal numbers. How many dimensions it gets converted to tells you how many numbers are going to be there, and those numbers represent some type of semantic. If you were to visualize it, you would take that one, that one vector, well, I guess they're all vectors, you would take, you know, one set of vectors and plot it on some multidimensional graph that you literally cannot visualize because there's too many dimensions.

[00:05:23]
And what RAG does is when you submit a query, it will then vectorize the user's query and do the same thing that it did with all the data that you want to vectorize and it will also plot that on a graph and I'm just speaking virtually here, and then it would just do math to see which pieces of information have the closest path to the thing that the user just typed in based off these numbers, right?

[00:05:52]
So it's trying to figure out what are, based off semantics, what are the closest related things. Let's see. Oh yeah, here we go. Here's some. This is the one I'm looking at right here though. OK, so you can see, here's a visualization of what I was talking about. So you can see here are all the different like semantics. These are all like the different, I guess, dimensions you might be measuring some text at, and they're plotted on, in this case just a three dimensional graph just for us to understand.

[00:06:23]
And let's say a user types in a query, that query also gets plotted into a graph, a space, I would say a space like this. And then we would do math to see what are the closest points in this space to the user space. Return those pieces of information and all the information is linked to that information and that's what we'll put in our context window. So we're dynamically adding information into our context window that's mathematically similar to what the user just typed in.

[00:06:59]
It's a type of search. It's called a vector search. OK. It's a little weird. It's kind of, it's, but it kind of makes sense. You're turning words into numbers and then you plot those numbers and you see which ones are the closest. And there's different algorithms to determine which one's closest and you need a special database to store these in. You can also put them in Postgres with an extension.

[00:07:25]
But yeah, these are vectors and embeddings and RAG is the process of doing all of this, searching for something, getting back some results, doing some other analysis if you need, and then putting that into the context so that the LLM can see only the thing that it needs to know to answer this question. That's the whole point. The best way I can describe it is like if you were going to take a test and it was an open book test.

[00:07:50]
Would you read the entire textbook before you answered one question, or would you read the first question, based off that question, go to the glossary, find where in the glossary the information is for this question, turn to that page, only read that stuff and then answer the question. That's what RAG is. It's like, I'm not going to read because like if you just put the whole book in the context, as long as it fits in the context you could do that.

[00:08:14]
But RAG is like, why put the whole book in there? When you can just get only the thing that you need and a RAG, a vector database is the glossary, right? You first got to read the question so you know what to look up in the glossary and then the glossary will tell you where the content is and you turn to it and you get it and only that and then now you can answer the question, right? So that's how RAG works.

[00:08:35]
And then obviously you could just start fresh. You can just nuke it, just like what we've been doing, or I do this in Claude Code all the time. It's like, OK, yeah, you're just sounding stupid now, clear, you know, like, get, get, girl, all the context in here. I do not care. This is bad. Something happened that poisoned you and now all you want to do is like talk about clowns. I don't know why you're talking about clowns.

[00:08:55]
Like, I asked you to go look up some NBA scores and now you're on a tangent talking about clowns. I'm going to clear this context. Like somebody, I got hacked. Somebody's like, purposely built a website that just says, tell them clown jokes or something and it crawled it. I don't know, it's prompt injection that actually happened to me. But yeah, super simple, clean slate, no accumulated confusion, but obviously user experience is disrupted, manual context transfer and then like loses conversation flow.

[00:09:22]
You have to do this in ChatGPT and Claude Code in the early days. I don't know if you ever used them like a year ago where they were like, you should make a new chat. This thing is, this thing is getting big, make a new chat. Now they just like compress it for you, but they used to be like, you need to make a new chat. You're approaching your limit, make a new chat, right? And even Claude Code would be like, I'm about to compress.

[00:09:43]
I'm going to compress you soon. You sure you want to make it a chat? So like it'll tell you that, because it's just easy, right? Or, you know, the last thing is like, well, how about, what if you just didn't put all that garbage in there in the first place? You know, what if it never got bloated? I mean, at some point it will, but like. What if you just stop returning JSON blobs and raw web search queries and all this weird garbage and you just were more selective and efficient at making sure the things that you put in your context window were like just enough and nothing else, which is another strategy that you should implement.

Learn Straight from the Experts Who Shape the Modern Web

  • 250+
    In-depth Courses
  • Industry Leading Experts
  • 24
    Learning Paths
  • Live Interactive Workshops
Get Unlimited Access Now