Transcript from the "Bloom Filters" Lesson
>> Let's do the last exercise in this course before I release you and your brain from this vise of knowledge that I'm throwing at you, this curse of knowledge. Let's talk about Bloom filters, which is just like a fun thing to say, it sounds like it's a really happy data structure, actually think it's named after the guy that invented it.
[00:00:26] They normally are, I can't remember why it's actually called a Bloom filter, but here we are. Bloom filters are something that you actually do see quite a bit out in the wild today. There's actually a really great article here on medium about medium itself that I dropped in here if you wanna take a look at it.
[00:00:44] And I also put the Wikipedia article in here so you can see, here's a bunch of places where people are using Bloom filters. So what is a Bloom filter? The Bloom filter is a data structure that was designed so that we could ask it questions about extremely large data sets and it could give us an answer of no or maybe, right?
[00:01:07] So it's a way that you design a set. And so when I say a set I mean I can add things to a set and then I can later go back and ask the set do you have this in your set? And we actually did this already with graphs, so let's just hop back over to graphs so I can show you concretely what I'm talking about.
[00:01:32] Here I have this set called seen, right? And this is a set. And then later I ask scene hey, have you seen this connection before, and it says yes or no, right? This has been added to my set, or this has not been added to my set. This is 100% accurate, right?
[00:01:50] If you ask it have you seen number 32, it will give you the definitive answer of yes I have seen 32, or no I have not seen 32 before and it will be 100% correct always. A Bloom filter, well, so the downside of this particular set implementation though is that you must store everything in the set, right?
[00:02:09] Everything that I add to the set has to be stored somewhere, so if I add 100 million things to the set, it has to store 100 million things, right? This is where Bloom filters can be kind of interesting, storing 100 million things doesn't sound like fun, what if I could store way less things and get most of the same advantage?
[00:02:30] That's what a bloom filter is. A Bloom filter, you can add things to it and it does kind of a little bit of a trick to say, hey, I might have seen this before, or no, I have definitely never seen this before. So by trading off that accuracy, you can drastically reduce how much storage you're using, so let's talk about when that would be.
[00:02:54] Normally we don't want a margin of error with our data structures, right? That's not something that's desirable about a data structure, however, using a lot less space can be. So, in this medium article right here that I'd linked you to, they talk about their recommendation engine. In the recommendation engine, you scroll down to the bottom of your article and they recommend you 3 more articles that you might like to read.
[00:03:19] Medium doesn't want to ask you, hey, you should read this for something that you've already read before. However, it's impractical for medium to store for every user every single time how many things that they have read that are related to something. That would be an enormous data set given how many users medium has.
[00:03:37] So, what they have done instead is they use a Bloom filter to say, hey, has this user ever read this article before? And the answer is either no, they have never read this article before, or maybe, they might have read this article before. And the advantage here is that, if they don't recommend something that you haven't read before, that's okay, right?
[00:03:57] It's okay for them to just say, here's 3 things that we definitely know you have never read. And so they're okay with that trade off of, we're willing to not recommend articles that they haven't read before to trade off in favor of much smaller storage. And we can always show them things that they had never read before.
[00:04:17] Is that kind of make sense? It's just, they have this acceptable margin of error. And you might ask how big is the margin of error? And then, it depends on how much space you're willing to take. If you're willing to take a lot of space you can get that margin of error down to like 1% or 0.1% or something like that.
[00:04:37] If you're not willing to do that, it'll get up to 50% margin of error it depends on how big you make the data structure. So, let's say that we have a Bloom filter with 10 elements in it, right? The way that we're gonna model a Bloom filter is we're gonna have an array of 0s and 1s.
[00:05:00] A 1 means that it has been seen before, and a 0 means that it has not been seen before. What I'm gonna do is I'm gonna run whatever string that I'm trying to put in the array. So let's say I'm trying to put Brian in my set, right?
[00:05:15] In my Bloom filter. I'm gonna run it through 2 to 5 hashing functions, all right? Even 8, right? You can as many hashing functions as you wanna do. The more hashing functions you use, the more space you're gonna use, but also yeah, it just depends on kind of the margin of error that you wanna do there.
[00:05:41] So let's say I'm doing it through 3, I'm gonna run Brian through 3 different hashing functions. And a hashing function if you're not familiar with something, just takes a number or takes a string and converts it to a number, right? So in this particular case, I'll run it through 3 different hashing functions.
[00:05:55] Let's say one of them gives me 2, one gives me 5, and one gives me 8. I'll go back to my data structure here and I will flip the two elements to 1, the five elements to 1, and the eight element to 1, okay? Now if I go back later and say hey, have you seen Brian before?
[00:06:15] I will run it through the same hashing functions, which will definitely give me back 2, 5, and 8 again, right? And I'll look at 2, 5, and 8 here and it will say maybe, I might have seen Brian before, in which case I wouldn't use it, right? Now, if I go back and ask later have I seen Sarah before?
[00:06:35] Let's say that it gives me 2, 2, and 4, so I go to 0, 1, 2, this one has been seen before, right? So there's that, and it goes to 4 which is this one, which is a 0, because 0 hasn't been flipped before, it definitely has never seen Sarah before, for sure, right?
[00:06:53] So if it encounters any 0s here you can go ahead and say, I've never seen Sarah before, you can safely use that one Okay, so let's say that we add one more item to the array which is Simona, which gives me back back 0, 4, and 5, if we went through hashing functions.
[00:07:15] So now I erase this. Now if I go back and ask Sarah again, if you remember that was 2, 2, 4, so 2, 2, and 4. Despite the fact that it has never seen Sarah before, it's gonna say yeah, maybe I might have seen Sarah before and in this case it would be a false positive, right?
[00:07:31] It would be coming back to you and say, I might have seen this before when in reality it hadn't seen it before. That's the entire gist of Bloom filters right there. Instead I could be storing Sarah, Simona, and Brian in the set, and I could get a 100% accuracy.
[00:07:48] But with this, even if I add 100 elements, I still need to take 10 items in the array, if I had a 100 elements, then I'm definitely gonna get false positives every single time, right? But I would still only take 10 items in the array, right? I wouldn't have to store 100 things.
[00:08:09] Okay. So, the more items you add to a Bloom filter, the more you increase the false positive rate that should be pretty obvious. You can mitigate this by having a larger array. You can also have more or less hashing functions depending on how much memory you're willing to fill up versus how many false positive you wanna be, right?
[00:08:31] So let's say I added one more hashing function, and that returned 9 for Sarah, right? In that case, I wouldn't have gotten a false positive. So the more hashing functions you're willing to add as well means you're likely gonna get less false positives. Not totally true, because whenever you add things as well, yeah, so you kind of have to mess around with it to kind of really figure out what your false positives rate.
[00:08:58] You get really mathematical with the false positive rate, so yeah, definitely just check out the Wikipedia they explain it pretty well there.