Memory & Token Limitations

Netflix

Lesson Description

The "Memory & Token Limitations" Lesson is part of the full, Build an AI Agent from Scratch course featured in this preview video. Here's what you'd learn in this lesson:

Scott discusses the importance of memory in AI systems and how it enables follow-up capabilities and task continuity. He also explains the limitations of memory in language models and different strategies for managing memory constraints. Scott also mentions the concept of retrieval-augmented generation (RAG) and its potential to improve memory and context in AI systems.

Join Now

Preview

Transcript from the "Memory & Token Limitations" Lesson

[00:00:00]
>> Scott Moss: I had talked about how memory works and why it matters. For example if I said, what is the weather like, and it is said its 72 and sunny and then I get fallow ups, follow ups, follow ups are so critical. They are so hard to do right especially if you get tool calling and RAG and all this stuff to be like as humans, we do stuff like this.

[00:00:25]
We don't we keep context in our head. And if I'm talking to you, I imagine that you have that context in your head. So I'm just gonna be what about tomorrow? Imagine if some random person came up to you and said, what about tomorrow? You're what the hell are you talking about?

[00:00:39]
[LAUGH] What Who are you? What are you talking about? It's cuz you didn't see this other conversation they had, right? You don't have the context. So it's very important for the AI to, have the context. So if you ask follow up questions like this, it can answer it, because either, depending on the reasoning capabilities of that model you're using, it's either gonna hallucinate and be like, I'm just gonna make something up.

[00:01:00]
Or it's gonna be like, I don't know what the hell you're talking about, right, what are you talking about? So that is probably the number one use case for having memory, is that you wanna be able to offer follow up capabilities, and you can't do that without memory.

[00:01:14]
So that's really good. Also task continuity, we'll talk about that when we get into loops and multi-turn interactions. Yes, agents can run on loops and things like that. We'll talk about that but then, yeah, memory limitations, so let's talk about this for a little bit. LLMs have fixed context windows.

[00:01:31]
I've been talking about this for a while. So you have to think of different strategies of balancing like, I need this thing to know all this detail, so I can offer a follow up and have a conversation. But damn, this person that's using it is talking a lot, and it's running out of memory.

[00:01:50]
What do I do? So you have to think about what does that look like? I've seen a million different ways to do that, right? You might see some apps that are like after you send X amount they'll Count every token that you send, and after you send X amount of tokens, or if you get past a certain threshold, so they know how many tokens their model can hold.

[00:02:10]
Once you reach a certain threshold, they'll start evicting tokens. So like an LRU cache, they'll go back to the oldest thing, and they'll start removing older parts of the conversation. So, you're now your assistant doesn't have long term memory. It has great short term memory. But if you ask it something 12 days ago and you use it every single day, it's not gonna remember that even though you're scrolling and you see it, it's visually in your chat.

[00:02:36]
They're not feeding that to the AI cuz they just evicted it. So that could be a bad user experience. But sometimes that works for apps that don't have a lot of friends, if you notice that your app, people don't send thousands of messages a day. They use it for two messages a day or something like that.

[00:02:52]
So maybe you can get away with that, because by time they run out of memory, that's a month later, and they don't really care about that anymore. So you can just evict those messages. Other things you'll see is every x amount of tokens, or every x amount of messages.

[00:03:06]
Summarize the conversation so far and put that in a system prompt, and that way I don't have to send every single message every time. Maybe I only send the last five messages, everything else is a summary of so far. With an emphasis on details, emphasis on personal details, put that in the system prompt.

[00:03:23]
And then, yeah, you lost some things, but at least you have enough there. And I guess the way I think about it is that's probably the closest how a human would work, right? I guess it depends on your memory constraints, but I would say most people probably remember things that are more recent, right?

[00:03:39]
And things that are longer, it's a lot harder to recall because, the details are a little fuzzy, it's hard to remember something like, I don't know, 10 years ago that wasn't that special in your life, versus something that happened yesterday. It was very ordinary, right? So it's because it just happened yesterday, it's more recent.

[00:03:53]
So that's what they're doing with AIs, like, take all these important things and not important things, summarize them. Emphasis on details put them in the long-term memory in the system prompt and then only show the last five or so messages. So, it knows very accurately about those things that just happened that works people do that.

[00:04:11]
I mean over time, you have to like make sure your summary is constrained to X amount of tokens, otherwise your summary keeps getting big. So, eventually if you just had this chat running for two years and it kept generating a summary, cuz when you generate a summary you gotta give it the previous summary to.

[00:04:28]
Otherwise, it doesn't know what that was, so it's like make a new summary off of these messages plus the previous summary. Eventually, it's gonna start to forget things that happened 10 summaries ago, right? So eventually, you still run into problems and there's like really no perfect solution about doing that.

[00:04:42]
It's really just about like balance. That's why in the early days of ChatGPT, there's like, that's why they encourage you just to make a new chat. It's like, can you just make a new one? Cuz like us trying to figure this shit out for you trying to remember it is not gonna work.

[00:04:54]
We just can't do it. Just make a new chat, please. And so that's usually the solution is just make a new chat, yeah.
>> Student: But could things like RAG be used to help with this?
>> Scott Moss: Yes, RAG, yeah. RAG stands for Retrieval Augmented Generation. It's a fancy word for, I won't get into too much detail.

[00:05:15]
But basically, it allows the AI to go search for the most relevant parts of a previous conversation according to the user message, and bring just those parts in. Yes, that can help if you build a good rack pipeline, and that is a multi billion dollar company of just figuring out how to do rack.

[00:05:37]
If there's one flashlight app of AI right now It's trying to figure out how to do RAG good. And every day there's two white papers that come out, and it's like, this is the best way to do RAG now. And it's like impossible to keep up. It is literally sent out a tweet a week ago, or maybe two weeks ago and was like, why is no one making RAG as a service?

[00:05:57]
This is so hard, and everyone needs it, this should just be a thing, and it just really isn't. Cuz it really is very hard. I had to spend like a month on our app figuring out our garbage implementation of it today. So like it's quite difficult, but when it works well, you're right.

[00:06:15]
It will do that, yeah.

Learn Straight from the Experts Who Shape the Modern Web

In-depth Courses
Industry Leading Experts
Learning Paths
Live Interactive Workshops

Get Unlimited Access Now