Four Semesters of Computer Science in 5 Hours, Part 2 Bloom Filters Diagramed
This course has been updated! We now recommend you take the Complete Intro to Computer Science course.
Transcript from the "Bloom Filters Diagramed" Lesson
>> Brian Holt: So.
>> Brian Holt: Picture for me if you will, basically I have an array of ten, if this will work, there we go. Nope, I wanna draw.
>> Brian Holt: So I have an array here. And I gonna give it,
>> Brian Holt: Four, five, six, seven, eight, nine, ten. Ten 0s, right?
[00:00:37] You can think of the 0s as representing false, right? That nothing has been written to that particular space, right? And then I'm gonna make a Bloom filter out of this. And the way I'm going to do that is I'm gonna have two different hashing functions. That number of hashing functions is gonna vary but for the sake of simplicity, right now, we're just gonna do two.
[00:00:58] Typically, you would have more. I'm gonna run whatever think it's added to this particular Bloom filter through both hashing functions. And the key here is the hashing functions are different, right? They're not the same hashing function otherwise it would not be much point to this particular algorithm, or data structure rather.
[00:01:18] So through you from part one, there's parts about hashing in part one, so check it out if you want more explanation. But to give you the 30 second version, it takes some item. So you give it some sort of output, and then it gives you, some sort of numerical representation of whatever that is, right?
[00:01:38] So if I give it some string and I pass in Brian into this hashing function it's gonna give me some index between 0 and 9 right? So it's gonna tell me the string Brian is stored always in index 9, right? Something to that effect. So same thing with how hashing tables work but useful for lots of things.
[00:02:00] We'll just leave it at that.
>> Brian Holt: So let's say that I'm adding my name to this, so I'm gonna add right here. Gonna say add Brian, the string. And I'm gonna run this through two hashing functions. The hashing functions are going to give me two different indices of where to put this, right?
[00:02:26] So let's pretend one of them gives me h1 equals index 1 and h2 gives me index 4, right? So what I'm going to do is I'm going to go here, and I'm going to change this one to a one. And that's two, three, four. So I'm gonna change these two ones to be ones.
[00:02:50] You follow so far? Cool. So now if I go back right after this, after I've added Brian. That I'm gonna go check to say, is Brian in this particular set, right? Or in this particular Bloom filter rather.
>> Brian Holt: So I'm gonna run it through these two hashing functions again, and I'm gonna get the same answers, right?
[00:03:11] Because that's one of the big points of hashing functions is, given the same string it's always gonna give you back the same indices. So I'm gonna go check Index one and index four, are those both ones, right? They are so its going to say maybe, its maybe in the set.
[00:03:34] That is the answer you get back from it. Now if I go and run
>> Brian Holt: Let's say I run something like Sarah.
>> Brian Holt: Let's pretend this gives me back h1 is gonna be equal to 0 and h2 is equal to 4. So now I'm gonna go check at index 0 which is a 0, right?
[00:04:04] And I'm gonna see there that that has not been flipped. So I can say this is definitely not the set, right? For sure 100% 9 if that first index is not flipped or not flipped but rather it's not a 1, it's not true. The Sarah is definitely not in the set.
[00:04:20] Does that make sense? So, hopefully you are starting to get the picture here, I'm actually going to be able to store a huge amount of information in this. It's going to be very fast because hashing functions are fast. It's very compact in memory, I'm only taking up 10 different things here.
[00:04:44] Let's say that I end up adding Sarah to this afterwards, this one's one and this one's one. Let's say that I do that and I get. And I add, later I'm gonna go check to see if let's say Shirley.
>> Brian Holt: Sorry, my handwriting on this thing is not great.
[00:05:13] All right, that's a Shirley, I promise. Let's say that h1 of this is 0 and h2 of this is equal to 1. I'm gonna check here in my particular array and guess what? Both of those are flipped to 1s. They are ones in the Bloom filter, so that's why the answer is maybe.
[00:05:39] It might be in the set, right? Because we actually never did add Shirley to our Bloom filter, but we got a maybe back from her because two other ones basically collided with it, right? So going back to our medium example, this would be some article that they actually haven't read, right?
[00:06:00] But it collides with other articles that have been added to those Bloom filters. So what happens is well, we're not gonna show you that article we're gonna go try and find another one, right? So this is the tradeoff that you have to be willing to accept, that's the point of the Bloom filter.