Model Token Limits

Netflix

Lesson Description

The "Model Token Limits" Lesson is part of the full, AI Agents Fundamentals, v2 course featured in this preview video. Here's what you'd learn in this lesson:

Scott explains building a custom compaction system, covering token counting, usage limits, and context window management. He discusses recursive compaction, potential data loss, and strategies for balancing performance and detail.

Join Now

Preview

Transcript from the "Model Token Limits" Lesson

[00:00:00]
>> Scott Moss:The next thing we want to do is we want to start working on compaction, because something like the web search tool will introduce tons of—I'm assuming this is assuming OpenAI doesn't already do this, because the web search tool is on their end. They might already figure this out, and what they return to our agent is already in a compact—I would understand if they didn't return raw JSON results to our LLM.

[00:00:28]
I would also understand if they did. So regardless, just to be responsible, we're just going to build our own compaction stuff because we'll need it even without the web search tool. Like eventually somebody's going to have a long chatty conversation and you want to compact it, so we still need it. So we are going to implement that, and the first thing we need to do is we need to count tokens and we need to know what the limits are.

[00:00:51]
So there's a lot of ways to do that. We're going to do it the simple guessing way, so it's not 100% accurate, but it's close enough, okay? Because I don't want to—the way that we have to do it is just too slow and too resource intensive, respectfully. So let's go fix some of this stuff. So if we go into our code and we go into this context folder that we have up here in agent, there should be a model limits file in here.

[00:01:22]
If you check out this model limits file, it's already got some stuff. It's got a default threshold of 80%. I've already put down the model limits of the models that I've been using, so GPT-5. The input tokens, the output limits, and then the total context limits. I forgot I put those in here. I should have looked for those earlier. Default limits, just in case it's a model that we didn't put in our registry here.

[00:01:49]
So if you want to change your models, I'm just going to default to this. You have to make some new ones. You have to go online and figure out what those are. And there's some utility functions to help us do that, but we need to implement these two down here. So we have is over threshold. This checks if a token usage exceeds the threshold, right? And then we need to calculate the usage percentage. So like what is the usage percentage so far given the total tokens and the context window limit of this model, right?

[00:02:22]
So we need to implement these two, so let's do that. So for is over a threshold, it's pretty simple. We just want to return the total tokens. Are they greater than the context window times whatever the threshold is? Basic math. So if the context window is 400,000, what is that times 0.8? So that the total tokens will be greater than 80% of that, right? So whatever the threshold is, that's what we want to do.

[00:02:47]
So in fact I might even just say like is it greater than or equal to, really. And then for the calculate usage percentage, also not that difficult. We're just going to say what are the total tokens divided by whatever the context window token limit is and then multiply that by 100. And that will tell us what is the percentage of context window have we used already, right? So far, given all the total tokens, input and output in this messages array, and then given the total context window limit of this model, which the ones that we're using are 400,000, what is the percentage that we've used so far?

[00:03:37]
That way we can show the user, like, hey, this is how much you've used so far, right? What would happen if you changed those numbers and like removed 30s on those thresholds? Like what is the—what do those thresholds represent? The 0.8, this one, or you mean the limits, the input limits, output limits? What are those numbers? Yeah, yeah, those are token counts that we saw. So for instance, ChatGPT-5, right?

[00:04:14]
ChatGPT-5 token context is 400,000, right? So that's what I have in here. 400,000. It's just what the provider told me. Oh, gotcha. So yeah, so if you lowered that, would it just make your—If I lowered that, then I would never hit the context limit because, you know, ChatGPT says it can go up to 400,000, but if my system thinks it can only go to 40,000, then I'll never hit the limit. Like it doesn't change anything for ChatGPT.

[00:04:43]
Like, I mean this does. If I made it 4 million, yeah, I'll get an error really quick because I thought it was 4 million, but ChatGPT says, or OpenAI says it's only 400,000. So I'm just putting the limits that I saw on their website in here so I can do my math based off of those limits. This is not a number that I came up with. I didn't come up with these numbers. I read these from their model cards here.

[00:05:08]
This is what their numbers were. There are services like there are npm packages that already have them listed out for you and they keep them up to date when new models come out. I just don't want to get burdened later in the course and somebody kills that package with npm or it ends up being like the new thing that's like creating a worm on someone's computer because somebody hacked npm, so I was like I'm just not going to do that.

[00:05:28]
I'm just going to hard code it. I don't want those problems. But yeah, I didn't come up with those numbers. That's what these models say. Every model has different limits. I just put them in here so I can do my math. Yeah. And then the threshold is the number that I came up with. I just put 80%. If you're like, I like flying close to the sun, I'll put 90%. Actually, I like 90.99%, you know, like whatever you want to do.

[00:05:54]
When you want to compress, you can change that number and everything changes, right? It's up to you. Any other questions? When you end up compacting things, presumably you're compacting something you've already done, right? As you recurse over time, does that cause additional degradation or is it—Yeah, yeah, I think that's one of the things that I put in there is that like, okay, you compact it one time, sure, you got one summary, lose a little detail.

[00:06:19]
You know, 300 chat messages later, now you're compacting again, which now includes a previous summary plus this new summary you're about to do. Now you got to merge those summaries. Yeah, it's going to get sloppy really quick. So summarization is not the best way I think. I think there's better ways to do it, but because it is really hard to get right, there's no one solution that isn't perfect. Most agents are just like, I'm about to compact.

[00:06:47]
If you don't like that, you can just clear it, which is what Claude does. They're just like, hey, I'm about to compact. Just letting you know, it's going to get bad when I do, so I'm just letting you know. And then usually that's when I'm like, yeah, no, no worries, clear or stop, start a new one. But if I really need what's in there, I might say, can you just put down everything we've talked so far in this markdown file?

[00:07:07]
And I'll put it in a markdown file, not word for word, but it's like kind of like a summary. And that way when I start again I'm like, hey, go read this markdown file first. I had a previous conversation with another Claude. I want you to kind of check that out and then now let's talk. So there's no perfect solution.

Learn Straight from the Experts Who Shape the Modern Web

250+
In-depth Courses
Industry Leading Experts
24
Learning Paths
Live Interactive Workshops

Get Unlimited Access Now