This course has been updated! We now recommend you take the Complete Intro to Computer Science course.
Transcript from the "Bloom Filter Caveats" Lesson
[00:00:01]
>> Brian Holt: So let's talk about some other properties of Bloom filters.
>> Brian Holt: As you may have guessed, you can't remove anything from a Bloom filter, right? Because if I go and try and remove something from a Bloom filter something else might have flipped that one as well. So you might actually be removing multiple things from your Bloom filter which is probably not what you wanna do.
[00:00:26] So, that's a property of a Bloom filter is you can never remove anything that you've added to a Bloom filter. The other thing that you have to be careful of this is when you'RE allocating a Bloom filter 10 indices is not nearly enough for a Bloom filter, right?
[00:00:43] Usually it's hundreds or a thousand or something to that effect, right? Because you can't expand a Bloom filter, right? Without keeping track of everything that's already been added to it which kind of defeats the purpose, right? So make sure you're allocating enough memory upfront for however many things that you're going to do.
[00:01:06]
>> Brian Holt: Yeah, anyway. So that's another thing you want to be keeping of track of is, how many things that, or make sure it's big enough. More hashing functions, right? So as you might imagine if I run something through three or four hashing functions I'm gonna get less false negatives, right?
[00:01:27] Because it's gonna be spread out over multiple parts of the array, but the tradeoff in doing that means that you probably need a bigger array because you have to able to store more things. So you have to decide what your tolerance for false negatives is too. There's complicated math that will tell you the percentages and things like that.
[00:01:43] And it's like, I'm not going to teach you how to do that cuz that would imply that I know how to do that. [LAUGH] Suffice to say, there are plenty calculators online that do it.
>> Speaker 2: So, you're saying the words, false negatives, and on the slide I'm seeing the phrase, false positives.
[00:01:59]
>> Brian Holt: Yeah, that's probably me just mixing up my terminology. So, yes false positives. Cuz the answers are no and maybe so it would be a false positive, not a false negative. Thank you.
>> Brian Holt: This is another good thing to keep in mind. Usually when people hear the word hashing function their first thoughts go to MD5, right?
[00:02:22] That's at least for me, when I hear hashing function I think of MD5. Or the other one is SHA-1 or 256 or something like that, these are not the hashing functions you want to use because these are cryptographically secure. Or I guess that's a, I'm not gonna.
>> Speaker 3: Used to be?
[00:02:39]
>> Brian Holt: They were designed to be cryptographically secure.
>> Speaker 3: [LAUGH]
>> Brian Holt: Let's go with that. So if I say the point of Bloom filters is to be fast and efficient, these are designed to be slow. Because that's what cryptography and all those kind of things are designed to do.
[00:02:55] So you don't wanna use a slow hashing algorithm, you want something that's actually really, really fast. So that's why you don't wanna use SHA or MD5, you wanna use something that's designed to go fast. In the exercise I've just provided you, the hashing functions, suggest to use those.
[00:03:12] But that's something to keep in mind.
>> Brian Holt: Any questions about what these are, how to use them.
>> Speaker 4: Is there a known hashing function with the uniform distribution for strings?
>> Brian Holt: I have a general rule that I never write my own hashing functions. So there is a library called XXJS, I think, is what it's called.
[00:03:46] That does that. Yeah. Yep.