AI Agent: From Prototype to Production

Advanced RAG & Fine-Tuning

Scott Moss

Scott Moss

Superfilter AI
AI Agent: From Prototype to Production

Check out a free preview of the full AI Agent: From Prototype to Production course

The "Advanced RAG & Fine-Tuning" Lesson is part of the full, AI Agent: From Prototype to Production course featured in this preview video. Here's what you'd learn in this lesson:

Scott shares some resources for building a more robust evaluation pipeline and advanced RAG fine-tuning techniques. He also discusses some advances in generative UIs.

Preview
Close

Transcript from the "Advanced RAG & Fine-Tuning" Lesson

[00:00:00]
>> Scott Moss: Last part is just, where do we go from here guys? Beyond just this, it's just more of the same really, at this point, it's just more of the same. Once you get to everything I just showed you today, it's more evals, more metrics, more prompt engineering, better RAG, you have the more framework level.

[00:00:26]
Right now we have one agent, you might split these off into many agents, to something called like a swarm, which, I keep thinking I'm in a browser. Lemme go here, open AI swarm, it's an educational framework, it's written in Python, but it's actually pretty easy to understand. Essentially, you have one agent whose job is to orchestrate all the other agents and each agent is assigned a set of functions in which they are only bound to, and it's called a swarm.

[00:01:00]
This is really good, this is something you can use and it's actually so easy to follow if you look at it, but there's tons of things like this. Yeah, it's mostly just more of that, right, if there's only one thing you ever learned from this course today, it's evals matter.

[00:01:15]
You have to do them, they shouldn't be treated like unit tests or testing your other code. You actually can't build good products with an AI, without evals, they literally can't do it. In fact, one of the creators of GPT, Greg Brockman, one of the co founders open AI, tweeted this out December 9, 2023 evals are surprisingly all you need.

[00:01:38]
[LAUGH] Surprisingly, really, if you just had some good evals, you the difference between you having a product that's great and a product that only that nobody can use. So I cannot emphasize how important evals are, so please spend time on evals if you are planning on ever having users.

[00:01:56]
It should not be an afterthought, it's not like testing, it is part of the product, it is part of it. Talked about advanced rack techniques in a little bit but like hybrid search, so we got rid of that. That was basically like the filtering, so being able to do like the metadata filtering plus the vector searching, that improves search results so much.

[00:02:15]
But it also exponentiates the complexity of how you need to test it, I haven't found a good way to test it yet, but when it works, it's great, it is really good. I think it's hard for a chat app, I think it's so much easier when the filtering can be selected by a GUI, where you have a search bar and a dropdown, you can do like facet based filtering.

[00:02:35]
Then it's like you get the best of both worlds, you get to pick the filters and you get to do the search. And it's great every time, but trying to figure out filters from someone's chat message. It's usually always too strong or not strong enough, it's just not good enough to actually filter on, so it's really hard.

[00:02:52]
So I haven't found a good way for that, but hybrid search is really great, highly recommend doing that. In fact, if you're gonna do RAG, you should always do hybrid to some degree in my opinion. Unless it's just like, you're chatting with a PDF, if you're chatting with multiple pieces of data and not just like one corpus of data that's chunked up.

[00:03:15]
But actual for instance, songs, movies, emails, things like that, you probably wanna do metadata filtering, hybrid search. But if you're like, I just trunked up this one gigabyte PDF, and I just wanna chat with it, you probably don't need metadata filtering, what are you gonna filter on?
>> Student: It's already filtered for you.

[00:03:32]
>> Scott Moss: Right, exactly, so context processing, I talked a little bit about this with the example with Anthropic. Contextual retrieval, different things like that, these are all just terms that I put here that you can go Google. I was like, I'm just gonna give them a bunch of terms that they can type into Google and go search themselves or ask the AI about.

[00:03:53]
Because I looked at my notes over the last six months, and these are all the things that I have been researching and implementing and testing. So I put them here for people to go look up, they're all insane, here are some of my favorite white papers that I read on RAG.

[00:04:13]
So self-RAG, context-faithful prompting, hypothetical document embeddings, RAG-Fusion, there's so many more you can go check them out. White papers are just like, if you don't know what those are scientific papers that scientists release when they come up with some new approach to something. So you check that out, Fine-Tuning, there's just no way we're ever gonna do fine tuning.

[00:04:36]
Let's just real, you're gonna spend 20 bucks, we're gonna be here all day and we don't have enough. Data but again Fine-Tuning is like you got on a plane you took an uber and then you hopped on a lime scooter. That's the Fine-Tuning it's the last mile of tuning and it's mostly good for changing the output of the LLM to make it sound.

[00:05:00]
Like something else, or look like something else, or, in the case of OpenAI, structured output things like that. So if you wanna change the format of how something is outputted, not because you need it to be a certain format, because it might actually save you money. So let's say you're using GPT4, and it's actually really good at what it's doing, but there's a lot of fluff words in there.

[00:05:20]
Those are a lot of tokens that you're being charged on, so you can fine tune it, to not add those fluff words and still get like good accuracy. And using the fine tune model, depending on which model you use is usually cheaper than using the non fine tune model.

[00:05:34]
So what people do is eventually as they get off of like a GPT model, they go to something like Llama. Or something that's like really big and open source and they'll fine-tune it for their specific use case and it's fractions cheaper than what open AI is offering. Plus they can put it on their own infrastructure or their own cloud infrastructure.

[00:05:53]
And they have no limits other than whatever access to GPUs that they have, so that's typically the strategy you'll see people do.
>> Student: So if you wanted to add domain knowledge to a model, you wouldn't do fine tuning for that?
>> Scott Moss: No, yeah, fine tune is not for adding domain knowledge that the model is already tuned, It's like you would have to make a custom foundational model from scratch, and anything.

[00:06:16]
Yeah, good luck, you got a couple 100 million couple Right. So yeah, not domain knowledge, it's just like tuning the output at those layers that control the domain out have already been tuned, and the fine tuning don't touch those layers. It's like touching the last bit of layers on the end of the neural net, and not the hidden layers which is what you would probably need for that.

[00:06:40]
So no, those weights and balances are already set, so yeah, it's mostly for output-related things, like how it looks, the shape of it, the tone. But yeah, not the knowledge, knowledge is RAG, you've got to use RAG for that, unless you use some crazy model that has a huge token limit.

[00:07:00]
And then again, you have of the problem of needle in the middle, the haystack can't find information there. You still spend a lot of money on tokens, you're setting up your whole corpus of data, so RAG is the answer for that. Yeah, talk about generated UI, Schema-Driven UI generation, we talked about this.

[00:07:16]
You can literally teach this thing how to do grids, flex, boxes, whatever you want, have it generated for you, so just think of the possibilities go crazy. Here's an example of using different interactions with different triggers and components and different layouts. And you can feed this to the to the UI or to the AI, and then the AI might have a system prompt of build.

[00:07:42]
The appropriate UI, depending on the answer that you wanna give, given these things, and send back an array of that, and it'll do it, and then you just render it, you loop over, and you render it.

Learn Straight from the Experts Who Shape the Modern Web

  • In-depth Courses
  • Industry Leading Experts
  • Learning Paths
  • Live Interactive Workshops
Get Unlimited Access Now