Transcript from the "Creating a Semantic Search" Lesson
>> We left off talking about semantic search, what it is, how embeddings and vectors work, gave some examples on that. Like I said, this is not something that you need to know, but it is definitely worth exploring a little bit if you have the time. It's pretty cool how this math works and seeing how it's basically the language behind systems like these AI models understand data.
[00:00:28] It's cool for you to be able to speak that language as well, so. But there's enough here for us to continue to move forward, and now I want to actually create a semantic search experience. So let's write some code. So what we're gonna make here is a movie recommendation semantic search.
[00:00:46] So basically, there's gonna be a list of movies in which this search engine is going to be indexed on. And you can type in the type of movie that you want to watch, and it will recommend a movie for you. So, a regular search engine might require you to type in a matching title or any other metadata associated with that movie, like a director, an author, a category, a year that it was released.
[00:01:14] So if you knew some of these things you could find the movie or may be they were smart enough to have the descriptions of the movies written out. So therefore if you type some word that is in that body of text description, it might find it. It almost feels semantic.
[00:01:29] But what if you wanted to describe in your own words the type of movie you wanted to watch, and none of the descriptions in the movies or index had those words exactly, but they had similar meaning? Okay, this is where semantics search will come in so it can help us find things that are similar in context and feeling verses just being lthe string matches the string, right?
[00:01:54] So, Yeah, I mean I have a paragraph here describing how this will work in a traditional search but here is an example. So, for instance, if we were looking for a movie that was kinda Inception, right? If you had a traditional search, you can type in Inception and it would just return all the titles of the movies that have inception in them, literally the word.
[00:02:15] But with a semantic search, we can type in inception. And it would understand the deeper themes and vibes of inception. So it would know that dream, manipulation, layered realities, a heist mind-bending plots, it would know all that, and it could recommend movies that have the same semantics as inception, and not just the title itself.
[00:02:38] Does that make sense? Okay, so you might get back some results if you typed in inception. You might get back The Matrix because it deals with alternate realities. You might get back Shutter Island. It's a mind-bending plot. You get it. So some of the benefit of this is that it's just more in line with expectations of what someone is thinking about when I type something in like that.
[00:03:06] It would suck to type in, I wanna see a really funny horror movie. And then you just get back really scary horror movies. [LAUGH] That's not what you were looking for. So this help you understand that. It's great for discoverability because now because everything is semantically related, it helps you discover things a lot better.
[00:03:25] So semantic search is actually great for a discover engine, in my opinion. It's one of the best use cases. Yeah, I just make sure user's happy. So let's get started with that. I'm gonna make a new file here, we'll call it Search like this. We're gonna import our OpenAI from the file that we created.
[00:03:47] If you didn't create the file, then you just have to make a new open, OpenAI instance and do the API key and all that stuff. So it should be good there. I don't have to do .EMV because it's already done here. So we're good there. And then we need to install something.
[00:04:04] So I talked a little bit about LangChain, we're actually gonna use it now. So we're gonna install this package called LangChain, which, again, is like an SDK. That's like a wrapper for LLM. If you wanna build LLM into your application, I would say today, LangChain is probably the most sophisticated library for that.
[00:04:24] And they handle a lot of different things. Now I think a lot of people are finding that they probably would do a lot of this stuff natively when it comes to like interacting with the LMSs themselves. Because like I was saying earlier on the break that I think l chain is helpful but then eventually you get to the point where you have to learn how L chain does things and that mechanism and how to learn it is.
[00:04:50] It's probably worth just learning how to do it without LangChain at that point. It's very specific. But there are things that are very obvious without a lot of documentation, and we're gonna use those obvious things. So, in this case, let's install LangChain. And they have a Python library.
[00:05:28] So, we've talked about two of these things already. So, for a document, let's see, that's from LangChain/, where is that? Document, I guess that makes sense. And then for the memory vectors, it's not gonna auto-complete it for me. I thought it was. That's gonna be from LangChain, the vector stores, and then, yeah.
[00:05:56] So for that, it's going to be LangChain/vectorstores/memory. I don't know why it's doing that, and then we can get the MemoryVectorStore. It's exactly what it sounds like. It's an in-memory vectorstore. Otherwise, we have to either connect to something hosted. It's a database. You either gotta go find a hosted one to connect to, download one, or use Docker and spin up one.
[00:06:17] We're not doing all that, we're just gonna do it in memory. It's way easier. We don't need to persist this across anything. But obviously, in production, you would not do this in memory, right? And then OpenAI embeddings. So this is the thing that's gonna create the embeddings for us using OpenAI.
[00:06:32] So up until this point, we've been using OpenAI just to do completions, which is like give me an answer to this question or chat. But OpenAI also has an embeddings endpoint where you can send it data and it will return the embeddings for it. So that's very convenient.
[00:06:51] It also costs money per token. So we will be doing the OpenAI embedding. And then from there you just wanna copy this movie data. This is the data that we'll be indexing. So I have it all here for you to copy it. Okay, it's just literally an array of movies.
[00:07:12] My mouse is, okay, calm down. There we go. There we go. So, we have our movies here. Like I said, it's just an array of objects. It's just the idea, title, description of some movies. Feel free to add some more movies here. Take away some movies. There's no right or wrong here.
[00:07:25] This is just some stuff saving you some time from writing things, but that's an array of movies. You get the point. And then basically what we wanna do is first we need to create a function that creates our vector store. We have to create it with that index of movies.
[00:07:45] And there's some things to think about to consider when doing that, and I'll talk about them a little bit. I'll get into more of it when I talk about scaling it, but let's do that. So let's make a function called createStore. And it's just going to return the memory vector store from documents.
[00:08:07] So we'll just say, MemoryVectorStore.fromDocuments, and there's different ways. There's from Text, fromExistingIndex, fromDocuments. We're gonna say fromDocuments, and then it takes an array of documents that you wanna create. So we have an array of movies here. What we need to do is we need to convert these movies over to these LangChain documents.
[00:08:27] A LangChain document is just an object in a specific format that all of LangChains vector stores are expecting. That's the sweet thing about LangChain. It does normalize how you can interact with vector. It doesn't matter what store you use as long as you use the LangChain version of it.
[00:08:43] They all take the same input, and that input as a document. So that's the beautiful thing about that, as you can change out your stores without having to really change your code as long as you always use documents. So we need to convert these objects into documents. So let's do that.
[00:08:58] So inside of here, I'm just going to just say, movies.map, map over the movie. I'm gonna get one movie here. And then what we wanna do is for each one movie, we wanna create a document. A document basically has like two properties on it. One is the page content.
[00:09:15] So you can think of this, this is the content that I wanna turn into a vector, basically. This is the stuff that I want to turn into a vector and put inside of a multi-dimensional space so people can do queries around it. So whatever about this movie, you want to be indexed.
[00:09:35] This is what you wanna put in page content, right? And then metadata is like all the stuff associated with this entity. That way when you show the search results, you can point to this metadata as the source. This is how I figured this out. This is helpful for like imagine if each one of these movie titles had a link that you can click on that will take you there.
[00:09:55] So I can put the link in here. So when I return the search results, I can show you the link to go to you can click on it, right things like that, or any other things associated with it. So it's like it's like the structured content of the non-structured content that will be indexed, right?
[00:10:09] So page content is usually non-structured. And metadata is usually the structured one that's not indexed, that's helpful after you find the results, okay? So let's do that. So we have one movie. Yes?
>> Can the metadata be used in your initial query? Say like, give me everything that matches this metadata and then do semantic beyond that or.
>> Yeah, so it depends on the memory or on the vector store. Some vector stores allow you to do what's called filtering. So you can filter the documents based off the metadata and then do the semantic search on the page content. So it's great for that. So you won't need the AI for the metadata filter.
[00:10:47] Yeah, but yeah, that depends on the store. Good question. Yeah, that's actually what people are starting to do now. I think, like Algolia AI. I think that's what they're doing now. Yeah, with their neural search, which is a combination of that. It's a combination of keywords and AI at the same time or something like that.
[00:11:09] I haven't tried it out. Okay cool. So we got that, we wanna return a new document. We got some page content here. And I mean, there's really no right or wrong way. It's just unstructured content. I want people to look at the title. So I'll say Title, and I'll put the movie title in here like that.
[00:11:29] And then I'll make a new line and I'll just throw the description in here. Like that. That's my page content. Leave whatever you want, just unstructured content that you want people to be able to semantically search on. And then metadata, you just literally store the whole movie in here if you want.
[00:11:51] So I'll just say I'll make one call like source so here's the idea of it. Like that. And then I mean, you could really just put the whole movie in here if you want it. There really is no wrong answer of how you wanna do that. Let me go back to what I had there.
[00:12:05] I think yeah, I just put like the title in here. That way I can quickly show you what that is. If I was making a UI, it might be helpful. Just be like, yeah, here's the title of it. Something like that. Okay, cool, any questions on that? And there are some implications of indexing this and we'll talk about it in the QA part, the document QA, but it's not always gonna be this simple as like I can just make documents and throw it in here.
[00:12:33] Cuz there's like context limits, and sometimes you can't index things that are big. Yes?
>> Just like we converted the objects into LangChain documents here, can we do the same with a CSV file with numerical data? For example, movie data, movie runtime, amount of views, and most time viewed.
>> 100%, not only can you do that because it doesn't care where you get the data from, langchain was smart enough to just make that for you. So, if you go to their docs and you go to Retrieval, you go to Document loaders, you go to Integrations, you go to File Loaders.
[00:13:08] There's a CSV file loader here that literally takes a path to a CSV file, loads it up, and creates a list of documents out of it for you. So yeah, you can do it manually, you can use their loader. This is why I said this is the part of LangChain that I like cuz it's very obvious and it has absolutely nothing to do with, it's not specific to anything.
[00:13:29] It just loads up a file and makes this array. That's great. So I'll use that. And I like using their interface with the vector stores because it works very well with this. Everything else it's a little difficult in my opinion. Cool, any other questions? Okay, and then the last thing we have to pass as a second argument from documents functions, we have to pass in the embeddings function that we want the vector store to use to convert all of these into embeddings for us.
[00:14:05] It's gonna do that for us, right? It's gonna make the API call to OpenAI for us and convert all these documents into embeddings. We just have to tell it. It's agnostic. It's not stuck on OpenAI's embedding API. It can use anything that follows a certain structure. So we're gonna use the one that they allow us to use, which is the OpenAI embedding.
[00:14:27] So we've seen new OpenAI embeddings like that. This is expecting you to have your API key and the environment variable. If you don't, you need to pass in your API key here. Like that. But I have my environment variables so we're good. So without this, it cannot create embeddings.
[00:14:47] Cool, any questions on that? There are also issues with this. There's rate limiting. You can only create so many bettings at one time. So there are a lot of issues associated with this that are very nuanced. So we'll talk about that later