Check out a free preview of the full AI Agent: From Prototype to Production course

The "RAG Overview" Lesson is part of the full, AI Agent: From Prototype to Production course featured in this preview video. Here's what you'd learn in this lesson:

Scott introduces RAG, or Retrieval Augmented Generation, which is a shift in how to approach LLM knowledge enhancement. Rather than relying on an LLM's built-in knowledge, RAG dynamically injects relevant information into the conversation by retrieving it from an external knowledge base.

Preview
Close

Transcript from the "RAG Overview" Lesson

[00:00:00]
>> Scott Mos: Understanding and implementing RAG systems. Retrieval, augmented generation, it's kinda like, when Angular One came out and they had trans exclusion, or what the hell was it called? I forgot. It's like, why are we calling it that? That's kinda what RAG is. It's like, I get it, but in reality it's just like the AI doesn't know something.

[00:00:20]
So you have to go get it and, well, let's take it a step. Let's go before that. If the, if you asked the AI about, if you asked GPT 4 mini about the scores of last night's NBA game, would it know it? Why would it not know it?

[00:00:40]
>> Jamie: It wasn't in the training
>> Scott Mos: it wasn't in the training data, exactly. So there's really two things it can do. It can lie, which, well, I guess not lie, cuz it doesn't know that it's lying. It can say something, that you know, if you just read the sentence, grammatically makes sense, but isn't actually factual.

[00:01:00]
If, you know, the creators of GT40 Money put some guardrails in there, it would reply back and say, sorry, I don't know that, but those are the only two things you're gonna get. It's gonna say, I don't know it, or it's gonna say something that makes complete sense in whatever language you ask it to respond with, but actually isn't factual.

[00:01:18]
So how do you get past that? Do you just have to like, I don't know, is OpenAI updating their model every single day and you gotta go get the latest model? Like how does that work? No, it doesn't work that way because training a model costs hundreds of millions of dollars.

[00:01:34]
So that is not actually what's happening. The other thing you might think of is like, well, maybe I'm fine tuning it. Which is similar to training, but it's more like, fine tuning is like, you know, you can take an airplane somewhere, and then when you get there, you can take an uber somewhere, and then for the last mile, you can hop on a scooter.

[00:01:52]
That scooter is fine tuning. It's just that last part, but everything else before that was the training that cost hundreds of millions of dollars, right? So you still aren't gonna fine tune every single day either. That would be annoying. So you wouldn't wanna do that. Costs a lot of money, needs a lot of data, takes a lot of time.

[00:02:11]
So you wouldn't be doing that every single day. So how do you get the AI to know something that it doesn't know? Well, you can just tell it, right? The best way you can tell the AI is in the system prompt. Whatever you put in the system prompt, the AI will now have that reference.

[00:02:30]
So if I say, here's everything you need to know about Jamie, here's how old he is, here's where he works, here's his favorite color, here's food that he likes, it now knows that, right? So, cool, if I want the AI to know about anything related to, you know, a game last night, I can just put that information.

[00:02:49]
This is problem solved, great. But what if I wanted to know about this entire book series that I read? And it's like, each book is a thousand pages and there's ten books, there's 10,000 pages,how many tokens would that be? 10,000 pages every few four letters is a token?

[00:03:12]
Yeah, I don't think it would fit in the context window. I would eventually run out and AI would say, sorry, I have too many tokens. I can't do this for you, even if you had something that was like way massive and it can fit a million tokens, that would still cost you a lot of money.

[00:03:26]
Even if it could fit all the tokens, you would be charged those tokens every request that you send up. Which, and because you have all those tokens, it's gonna be slower to get a response back because it has to process all that. And you have this problem of, I forgot the actual name of it.

[00:03:44]
It's like, on the tip of my tongue. But basically, if you have, like all this information that you gave the AI, and you ask it a question, and the relevant information for that question is in the middle of that information, it's never gonna find it. It's only really good at finding information at, like, the top and the end and not in the middle.

[00:04:01]
It kinda forgets. So the more information you have, the less likely it'll be able to give you an answer over something in the middle. So it's like, it's great that we have these new models that are coming out that are 128K tokens, all these crazy token limits, but we still got to pay a lot of money for that every single time.

[00:04:22]
It's still gonna be slow, and the AI, even though it has information, it's still gonna forget because that's so much information for it to remember, right? So, RAG tries to solve that. What RAG tries to do is it tries to find the exact information that you need or the AI needs to answer the question that you asked and only that.

[00:04:46]
So it tries to find that, grab that information, put it in, you know, the response, the system prompt, wherever you want to put it, and then it answers it. So you only have the bare minimum of what you need to answer the question. So one, it's faster as far as like when it gets to the inference level, it's cheaper because you're sending up less tokens.

[00:05:07]
And because there is less information, less tokens in the prompt, it won't forget stuff because it's not like a big text. So it tries to solve all of that. The problem is, RAG is really hard to do. And I would say Evals might be the number one problem right now with LLMs, a very close second, maybe even a tie is RAG.

[00:05:27]
Those are the two most challenging problems people are trying to solve today. You see a lot of companies trying to solve this, a lot of people trying to solve this, a lot of research scientists trying to figure out a new RAG approach, a new eval, a new RAG approach, a new eval.

[00:05:39]
I would say that's 90% of what's happening today and then obviously you have the foundational people trying to make it better at reasoning. They're like, how do we make this thing reason better? How do we do more tokens in the context window? But everyone else is trying to do these two things.

Learn Straight from the Experts Who Shape the Modern Web

  • In-depth Courses
  • Industry Leading Experts
  • Learning Paths
  • Live Interactive Workshops
Get Unlimited Access Now