Token Limits & Context Windows

GitHub

Lesson Description

The "Token Limits & Context Windows" Lesson is part of the full, Practical Prompt Engineering course featured in this preview video. Here's what you'd learn in this lesson:

Sabrina highlights the importance of understanding tokens and the context window. Tokens are roughly .75 words on average. Since LLMs do not have a "memory", the full conversation is passed with each prompt so the LLM can "remember" what the conversation is about. Long conversations run the risk of filling the context window and can lead to hallucinations.

Join Now

Preview

Transcript from the "Token Limits & Context Windows" Lesson

[00:00:00]
>> Sabrina Goldfarb: We have a couple more slides to get through before the super fun prompting stuff, but this one is a really important one, and I've been utilizing the words "words" and "tokens" interchangeably up until this point, but I would like to now dial this back and make sure that I am using them correctly from this point going forward. So, tokens are roughly 0.75 words, but not always. This doesn't count in the case of code, this doesn't count in the case of punctuation, but the same way that we use words to understand what we're saying, LLMs use tokens to understand what they're saying.

[00:00:39]
So, our words are broken down into an ID and then ingested by the LLM, and that's how it understands what we've been saying. It breaks it down into a token. And like I said, tokens are roughly 0.75 words, but not always. Something really cool and interesting is if I use the word JavaScript with a capital J capital S, it is a different token ID than lowercase J and lowercase JavaScript. So I just think it's pretty cool because there's no way that we're ever going to learn every single token, and there actually used to be on the OpenAI Playground, a place where you could look and see and type in your tokens and see the token IDs that came up from each one, which is how we know that that's true.

[00:01:25]
Unfortunately, that's been taken down, and I'm hoping that someone will put that back up at some point for at least a model, and then I'll happily add it to the resources for this class because I think it was so cool to get to see the different tokens with the same exact word. And you would often see that spaces are put in front of the words and are considered as part of the token. So this is not going to be a class of trying to figure out exactly what token goes where or what token breaks down or what, but just know that tokens are to the LLM as words are to us.

[00:02:01]
It's important for us to also know that LLMs technically have no memory, right? And you might be like, wait a minute, Bri, LLMs of course have memory because I have a conversation all day and that LLM knows exactly what I was talking about a little bit ago, right? Technically, that's not the case. LLMs have no memory. The way that they remember things is through their context window. So we are sending cumulative tokens, which are our input and output history every single time we have a chat message with an LLM.

[00:02:39]
So let's say that I'm talking to Claude, right? And I send Claude "what color is the sky," and Claude answers "blue," and then I'm like, "But what about when it's the sunset," right? Then the messages that are being sent to Claude are "what color is the sky," and then Claude responded. So when I send that second message of "what about when there's a sunset," that's being sent in cumulative tokens.

[00:03:07]
It has "what color is the sky" and "blue" also attached in the background. Just because we don't see it doesn't mean it's not also there. So we just have to keep this in mind because LLMs have something called context windows, which are the maximum number of tokens that the model can remember. Now, models used to have fairly small context windows, and they're going to sound pretty huge because they're still in the thousands of tokens, right?

[00:03:35]
There's still, you know, 2,000-4,000 tokens, but now models have millions of tokens potentially within their context windows. And we're going to talk a little bit later about how well they actually utilize those context windows, because you might be surprised about that. But just know that the context window, when we talk about that, is the maximum number of tokens that the model can remember, right?

[00:03:57]
So the amount of words that we can stuff into that cumulative token history, send back to the LLM for it to have this whole memory of this conversation together. One really important thing to notice about this is when you hit limits, oldest context drops off silently. So we will not be notified of this, right? If you are using a model that only has a 4,000 token limit, and you go past that 4,000 token limit, what you talk about with that model prior to that 4,000 tokens will start to fall off, oldest first.

[00:04:38]
And this is really important because if I'm utilizing the model, say, to run a new electrical system through my home, right? And I'm like, the most important thing you can know is the first thing I send, which is, "please don't burn down my home." And then all of a sudden I've run out of context window, and all of a sudden Claude has forgotten that I said, "please don't burn down my home." Claude might be a little bit more, you know, willing to try some things that might burn down my home.

[00:05:07]
And I'm definitely joking, but also not joking in the sense of we really can lose really important context, right? So if I'm having a long conversation about the code that I'm writing and I start to see, why are you drifting off when the first thing I said is this is what my feature should be? Well, maybe that context has actually fallen off, right? Maybe you need to bring it up again. What if you connect a file to your context, like back in your history, which you always want to reference and then you add a bunch of tokens later?

[00:05:41]
Will the file still be part of your context or is that going to drop off too? Fantastic question. So it depends on the provider that you're using actually, which is really interesting, but if you're using like a Copilot, Cursor, Claude Code, it probably will actually behave very differently than if you're utilizing this just in chat, because if using this just in chat and you've copied and pasted your code in once, then that context is going to fall off.

[00:06:09]
If you're using this in Copilot or Cursor, I can't say for sure, but likely it's trying to reread that file as long as you've attached it. So you're going to kind of have to check and try, but this is something that's really important for us to consider too with code bases, right? I have a feeling that at least somebody that's here or somebody that will be here watching at some point today will have done the like @codebase in Cursor or Copilot to add their whole codebase to be like, "Just write this code for me.

[00:06:43]
Here's my whole codebase," right? But we have to remember when we're doing that, we are taking up so much context window, especially imagine if you're in a monorepo, right? You're likely filling up that whole context window if you're using a smaller one of these models. So please, when you are trying to add code, try and limit what you're sending to the LLM as much as you can. If I only need to add a test file and, you know, maybe a frontend file or something like that, just add those two.

[00:07:18]
Please don't add the whole folder or the whole repo because you're going to be like, why am I getting really bad results? Well, it's trying to read so much context that it doesn't have this ability, same as us, right? If I'm trying to add to my repo, I'm not going to look at every file in the codebase in VS Code, that would be really overwhelming. I'm just going to look at the things that I need, right?

[00:07:41]
The prior art, the examples that I have in front of me that I actually need and the files that I need. So please try and keep that in mind. That's a great point about context windows is when you're writing code, especially, try and give the minimal amount of context needed to get a good output still. OK, let's move on to the system message, which is the invisible personality behind every AI interaction.

[00:08:15]
This is usually set by the provider, OpenAI, Anthropic, GitHub, Cursor, etc. And it takes up part of our context window, yikes, right? We can't see the system message. We don't know that it's there until now. Now we know that it's there, but we can't see it. So how do we know how much context it's taking up? We don't. So if we're using a small model, especially one that can only take in, say, 4,000 tokens of context, what if our system message is like 500 of those context window tokens?

[00:08:48]
The system message is always going to be there in the background, no matter what, which means our context is going to fall off before the system message is going to fall off because the system message is not going to fall off. This is the way that the model interacts and reacts to you. So maybe if I'm in Claude Chat, maybe the system message, and they've been leaked before, so you can, I'm sure, find an old GPT or an old Claude system message, but maybe the one in Claude Chat is saying, you know, "you are a friendly generalized assistant, you know, do X Y Z but obviously please nothing hateful or dangerous," right, all those things.

[00:09:34]
But maybe if I'm in Copilot or Cursor, it's, "you are a helpful coding assistant," right? And those very small changes to the system message will make these AI models behave differently. So this is also why the same model will act differently depending on the tools you use. If I use Claude Sonnet 3.5 in chat, and I use Claude Sonnet 3.5 in GitHub Copilot or in Cursor, it might behave really differently and that's because there's this underlying system message.

[00:10:03]
When we use the APIs to build our own AI applications, we can also usually change the system message. Some of the APIs are changing now to instead have like a developer message instead of a system message exactly, but all you need to know is that there is a system message back there. It's taking up context and it is making the LLM behave in a certain way towards us. One last thing about system messages is that they are private in some sense.

[00:10:36]
They are easily jailbroken to be seen, which is how we've seen them leaked before. So please never in your system message put anything confidential or private or potentially problematic for your company or yourself. Obviously no API keys or anything ever should go into any of these models, but especially your system message, there's a good chance it could be leaked. There are also times where people have found ways to jailbreak the system message and get these LLMs to behave in different ways and kind of depending on the provider, depending on the model, I'm sure it's more or less challenging, but just keep this in mind when you're building your own AI applications that the system message is much stronger than the user message, the user message just being if I go back to Claude and I type in like a message, right?

[00:11:22]
This is a user message where I'm literally just telling Claude something. The system message, like I said, is kind of behind the scenes controlling things, but that doesn't mean that it's private and that doesn't mean that it's perfect. It will be more safe than an input message from a user, like if I was telling my LLM to treat everyone like a pirate today, right? It'd be less likely to give up on being a pirate if it's in the system message versus in the user message, but there are still ways around that as well.

Learn Straight from the Experts Who Shape the Modern Web

250+
In-depth Courses
Industry Leading Experts
24
Learning Paths
Live Interactive Workshops

Get Unlimited Access Now