Transcript from the "Scaling Semantic Search" Lesson
>> Okay, let's talk about, how do you deploy this? Okay, scaling semantic search, couple things. So let's talk about the volume of data. So as you start to get more data around what you need to index, handling that index just becomes incredibly difficult. So right now, we had an array of movies that we need to index.
[00:00:22] But what if the data which you need to index is not static, it changes? So you have to keep creating indexes and embeddings. And then how do you go back and update data that was already turned into a vector because it got updated in the original source? So keeping your vector database updated with the fresh data also with new data, that's complicated.
[00:00:44] And then doing that at scale when things need to move fast, how do you ensure that people are querying against the latest data, cuz you have to account for index times and things like that? Also, because you might have more data to index, you run into things like rate limits on the on the embedding API call.
[00:01:02] You run into all different scenarios in which you have to consider because it might not work the first time. You also run into, if you think about it like this, so I return back these results. These results don't have that much stuff in them. But as, you add more, imagine this page content had, I don't know, 20,000 tokens in it, and you want to return 6 of these, that's a lot to having to deal with on a system.
[00:01:34] So you have to think about that. And when we talk about document QA, you'll see that actually is a problem for the AI that it runs into. So got some problems there. The complexity of the queries. Sometimes the semantic search doesn't work the way that you would think because a person can type anything.
[00:01:51] So I might ask something that has nothing to do on which the index was built upon. So what do you get back? But if you made it in a way where it always returns something because it's just finding the closest thing. So let's say it does find the closest thing, but the closest thing is way over there because the query that someone wrote, they're gonna get back a response.
[00:02:11] They're gonna be like, what the hell is this? This is not what I was looking for. But that was the closest thing the index had, so that's what you got back. So it makes it feel like the search is broken. It's like this is literally, I asked for dogs, and it returns something about chocolate cakes.
[00:02:25] And you're like, how is that the closest thing? Well, I don't know. Somehow it was based off whatever query you wrote. You wrote, I don't know, dogs that like treats. So it returned chocolate cakes. I don't know, it could have done anything. So it could feel broken. Consistency is a big thing.
[00:02:43] Computational cost. Yeah, having to embed, that costs money. And then all the problems around embedding, like I said. And then having to make those search queries, that costs money, all that stuff costs money. Storage overhead, I mean, you're basically duplicating data, right? You're saving stuff in a vector database.
[00:03:01] You're also saving it in its source in another database somewhere. You probably also have a cache somewhere, like a Reddit, you have the same data in three or four different places. Good luck, [LAUGH] and then, yeah, trying to make retrieval faster. Right now, it was pretty fast, but it normally is not gonna be that fast, because this is more data.
[00:03:22] There's more things to go through, there's more embeddings in your database, so there's more vectors to consider. So it can go pretty slow, so you have to start thinking about, how do I batch these things? And then scaling is always a concern no matter what you're building. And then, yeah, so optimizing for speed and accuracy.
[00:03:43] This is what most people are trying to do right now. So they're trying to tune their things, so fine tuning it on semantic things. So there's so many ways to do this. One of the ways you might do it is a human and a loop approach, which is a type of reinforcement learning.
[00:03:59] You've probably done this on ChatGPT, where it's like, if you get back a response, you can put a thumbs up or thumbs down. That's reinforcement learning, that's called human in a loop. There's a human involved that provides some indication on whether or not this was the right thing.
[00:04:13] That correction then goes back into the model, in which over time, it adjusts its weights. So therefore, it's being tuned in real time. So that might be something you have to do based off the results of your search like type in something, get back a result. Hey, was this the thing that you wanted?
[00:04:28] Or maybe you can imply that if they click on the result, they type in something, you give them back a list of results. They click on the thing, I guess that's the thing that they wanted, cool, thumbs up. Tell the model that that was right. Or they didn't click on something, they type in something again, thumbs down, that didn't work.
[00:04:42] Or you manually ask them, thumbs up, thumbs down. So that's one way to do reinforcement learning. Yeah, there's a [LAUGH] lot of ways to try to figure that out, and then take those learnings and maybe even fine tune. If you realize that 90% of your searches are about people trying to find a recipes, okay, fine tune your model on recipes and go from there, and don't worry about the other 10%.
[00:05:05] So you got caching. Here's the feedback loop, that's what I was talking about. Yeah, and then, of course, well, I guess you don't really have to worry about this if you're using OpenAI. But if you have an open source model, yeah, optimizing your bare metal GPUs and all the things around that, that's where all your money is gonna go, for sure.
[00:05:25] But almost no one needs access to GPUs right now unless you're building a foundational model or you're training something. Unless you're a Stability AI or Anthropic, there are some startups out there that need access to GPUs. But for the most part, people are just using APIs like we're doing today.
[00:05:43] Cool, any questions on that?
>> Are there are things to think about in terms of if you have a broad data set, you wanna kind of filter that down to certain aspects and build separate vector data stores based on the context of the questions, or does that make sense?
>> It does, so a lot of vector stores have the concepts. They all call them something different, but you can group indexes based off certain things. Maybe it's a multi-tenet system, so it's grouped based off customer IDs. Maybe it's grouped based off of whatever you want. So you can group those indexes like that, so that's one way.
[00:06:30] That way you're not creating the whole database to find something specific. They have that. And then I think they also have another level of grouping, but it really depends on the database. And I don't know why I can't think of the number one, let's find it. I'm just gonna type in vector database, cuz I want you to see this, Pinecone.
[00:06:48] Wow, why can I not think of that? Insane, so Pinecone has, let's go see if they talk about it here somewhere. Come on, show me the thing. Okay, do they have docs? Okay, here we go. They have the ability to do, let's go manage indexes, to separate it like that.
[00:07:17] So let me see. So yeah, you can create an index, you can give it a name. And then, Collections, they call them collections. So you can have collections within indexes, which you can separate even further. So an index might be for a customer, and then for that customer, you might have different collections for different types of data on what you need to search on, but every database is different in how they do this.
[00:07:41] So that's just one example of it, yeah.