Transcript from the "Bloom Filter" Lesson

[00:00:00]

>> Okay, so we do have time for our super bonus algorithm. I'm gonna do it. Okay, it only takes like five minutes. Okay, this is my favorite one. This is the one I'm most excited about. This is the one I really wanted to do. I have never actually tried to do this and I've only ever seen it.

[00:00:14] I've never actually used it and I always wanted to do it. It's called a Bloom filter. So if you don't know what a Bloom filter is, it's pretty exciting, all right? It's a very fast way to tell if something is in or not in a set, meaning that, I know, that's kind of a weird way to do it, because it's like, why not in?

[00:00:34] How it works, effectively, is that you have a series of hash functions. You have, say, three hash functions, and you run through the hash functions, marking into a set. We'll talk about that here in a second. And what a Bloom filter can do is it can say, yes, something could be in this set, maybe.

[00:00:54] We're not 100% sure. But it can say with 100% certainty if something's not in the set. And so Akamai uses this for what they call one-hit wonders. They use a more complicated version, but it's the same principle, which is hey, this one website got loaded. Do we need to store this into CDN?

[00:01:11] Probably not right now, but what happen if we're gonna see that site a bunch of times? Then we're gonna want to see it. And we're gonna want to put that into a special place where we can have it highly cached and it moves really, really quick. So it has the chance of false positives, but it has no chance of false negatives.

[00:01:25] If it's not in the Bloom filter, it's not in the Bloom filter at all. So let's go over what that means really, really quickly. It's pretty simple, let's pretend we have one byte of data. Right, so this is one byte. All right, I have eight bits, one byte, and we have a hash function that's pretty easy.

[00:01:44] If I provide a value, I'm gonna use the value itself. So that's one function is just to return the value modulo the length. We're gonna do another one where I square the value modulo the length. And another one where I double the value and then modulo the length.

[00:02:02] So, what is that gonna look like? I wish I could do some bigger math, but I can't put like 69 in there because I don't know how to square 69 in my head, and then modulo the length. Way too difficult, completely out of my powers, and so, and I'm also gonna just use 10 instead.

[00:02:15] Let's not, because 10 is just way easier to do modulo math on, because you just use the last number, and there you go, fantastic. All right, so let's pretend we're gonna store the value 10. All right, so what is 10 modulo length? 0, right? So I just assume this is 0, this over here is 10.

[00:02:36] So therefore I'm gonna put a 1 right here. All right, then we're going to do the next function, which is, say, that's 100 modulo the length. That's a 1 again. All right, then we're going to double it, which is 20, which is 1 again. Okay, we have a pretty crappy hashing algorithm, but let's keep on going.

[00:02:53] All right, let's do a 7. All right, 7 length, we're going to go over to 7, right? Not a 7. I wrote a 7. 7 modulo length is 7 right here. 7 squared is 49 modulo length is going to be 9, which is gonna be this answer right here.

[00:03:13] Wait, hold on, that means 7's right here? There we go. I'm really good at this. All right, and then the last one. Double, which is 14, which means we're gonna have the 14th one. Zero, one two, three, four. All right, there we go. So that is our Bloom filter for seven.

[00:03:27] So now let's say, does eight exist within this set? Let's find out. So what we'll do an 8. So, 8 modulo the length is 8 right here. Does it exist in the set? No, it does not. We can just fast bail right out of it. We don't even have to think about it.

[00:03:42] It does not equal this set. All right, let's do, how about 9? We'll do a 9, a quick 9 check. All right, 9 modulo length. Okay, maybe nine's in here. Let's do it again. What's 9 squared? 81, right here. It's not in here. So we know right away that 9 was not stored within this filter.

[00:04:03] So the thing about a Bloom filter is you don't remove values. Because you can't remove values because there's some level of overlapping at any one point. And so that's kind of an important thing. So if we were to put 5 in here, we would go over here and we'd set 5 to 1.

[00:04:17] We would then square it, again, it gets written right here. And then we double it, which is a 0. And then we'd go right here. And so you could imagine that if we try to remove 5, we would remove this as a 0 and this as a 0, which would also remove our value 10 out of here.

[00:04:34] So there you go. That's the basic principle of a Bloom filter. So here's where it falls apart. Let's say we do 20. All right, 20. What's the modulo of 20?

>> 0.

>> 0, okay, could be in here. What's the square of 20?

>> 40.

>> 400.

>> 400.

[00:04:51]

>> 0 again, okay. Times it by 2.

>> 40.

>> 40. Zero, okay. It could be in the set. Is it in the set? No, but it could be in the set. And so you use this if you have a really expensive operation, if this operation takes like seconds of time.

[00:05:09] But you could have a Bloom filter that says like a megabyte of ones and zeros, and you can just flip a bunch of modulos in there. And to do the hashing function takes like one millisecond. Of course, you're gonna wanna do this. Because you may only have to do this every now and then.

[00:05:22] And you just only want to do it if it could be in the set. If it's not in the set, just don't even do it. And so Bloom filters allow for super fast maybes and maybes are just way, way, way important. And so you can do a lot of super cool stuff.

[00:05:38] This is probably one of the most practical algorithms that we've shown today. Just because you can actually do something with this out in the real world pretty easily. You can think of many expensive operations in which the key, you don't need to do all of it. Again, this is how Chrome used to do stuff for malicious websites, is that they figured out malicious websites, but to check if that website's malicious, they'd have to go make a call somewhere else to some sort of service.

[00:06:00] So instead they had a basic Bloom filter that could take in any website and go, could you be malicious? Based on our hashing functions, you could be malicious. But at this point the internet's gotten so big and there's so much maliciousness, t hey now have a different algorithm.

[00:06:12] But at one point when times were a simpler and people only cheated the old-fashioned way, then it was easier to use something that could be checked really quickly and then verified against a slower service so you wouldn't have a bad internet experience. Because if you go to Google.com, you'd expect it not to intersect.

[00:06:27] You'd want it to go super fast so you don't have to wait for a network call before you can even look up the web page. I think we went pretty dang fast, question.

>> When you say maybes are very important, do you mean like maybe is in like the monad, maybe?

[00:06:41]

>> No, no. What I mean by that is it's maybe in the set. We don't know if it is in the set, but it looks like it's in the set. Just like how we had a 10 put into this set, but a 20 looks like it also belongs in the set.

[00:06:53] It may be a part of this set. It might not be, we don't know, but it definitely is potentially in the set. Whereas a 9 you look it up and you go, it's not in the set. It's just not even there. We know right away. Don't even worry about 9.

[00:07:07] And so the maybe question is really important. Not like the monad, okay, monad's mentioned, people are just loving it. People love monads.