Introduction to Data Structures for Interviews Hash Tables
Transcript from the "Hash Tables" Lesson
>> Bianca: Next thing we're gonna talk about hash table. So a hash table is not ordered, so we don't sort hash tables or anything like that. I guess hypothetically you could, but if you were needing something with an order this is not the data structure that you would choose but It is really good for fast lookups and having keys that are flexible.
[00:00:23] So an array index is a numerically based, which can be a constraint, which might not be appropriate data structure for the problem that you are solving. With a hash table, depending on the underlying, depending on the interface, at the very least your keys can be strings. But with new ES6 data structures, we can also store functions and objects as keys if youuse like a set or a map data structure, so that's pretty cool.
[00:01:32] So we need to be able to translate our keys into some memory address, cuz we need to go directly there. So when we are looking for something we either have to know the physical location somewhere or whether it is on disk or in memory. You could think of this as some mathematical formula that says I can just point here and that's that value.
[00:02:02] I don't need to loop through all of these memory spaces to find a particular value. So when we think about a constant time look-up, which means we go directly to the place that it's stored, we need to somehow mathematically calculate that location, right? And obviously that calculation is really, really complex but we can abstract it away just to understand how it works and the core of it is a hashing function.
[00:02:31] So the hashing function takes the key, in this case, we're gonna look at strings. It'll take a string and it's gonna hash it to a numerical value that in our case is going to be an index of an array, but under the hood it might be a memory address, yeah.
[00:02:51] And so hashing function is just a function. For every same input, it always gives the same output, so for every Cristiano is always gonna give 2 every time. That's the property of a hashing function. Hashing functions are used in all kinds of like, you know, you're hashing for security and things like that, passwordw, etc.
[00:03:21] This is a very similar concept except that we are using it to hash to a numerical value in some range and the range is the size of our hash table. So this hash table is the size of 25, so our hashing function, no matter what we give it, it's always going to give us a value between 0 and 25.
>> Speaker 2: And we specify the range? Or it is always gonna be 25?
>> Bianca: So when you are interacting with a hash table you don't even have to think about it all but when you are implementing it, you have to specify it. And then once it becomes full, then you need to make it bigger, and there are strategies on how we do this.
[00:04:05] So typically what you do is once your hash table becomes 50% full, you want to double it. And that's been found like mathematically to be the most efficient, so once it's half full you double it and then you rehash those values to 0 to 50, so then it gets redistributed.
>> Speaker 2: Do you need a new hash function then to like, account for the bigger size?
>> Bianca: So the hash function should take the size. And so, it should be the same function except that you just passed like, instead of 25, then you would pass 50 and you would just rehash everything.
>> Speaker 3: What do you mean by full? Once it gets full.
>> Bianca: Yeah, so how you define full varies but typically, you just count how many items, you kinda have a length or size property under the hood. And if you get 13 items in your hash table of size 25, then you double it, and then you redistribute all of your items within 50 and then once you get to 26, you double it again.
[00:05:25] That's the general strategy.
>> Speaker 3: Why half?
>> Bianca: Why half, there's white papers and stuff on it. That's just typically what you do, but there are all kinds of different strategies, and different implementations, and optimizations for specific use cases. But that's kind of like the general on average the most efficient way of doing it and the reason, how they got to the half is based on how many collisions would happen if you're more than half full.
[00:06:08] I don't know the specific math but since our hashing function only gives us from 0 to 25 there's a very high chance, it happens a lot, that for different keys, you get the same hash, right, and we'll see Anita and Antonio both has to 0. And so we call this a collision, and when we have a collision, if we have many collisions, then that can affect our run time for a look up.
[00:06:42] And so when we think about hash tables, we think of this concept of the average case. In a lot of our time complexity analysis we think about the worst case. However, for things like hash tables, we think about the average case because the chance of the worst case happening are so slim that we don't really even worry about it.
[00:07:06] So that's why we say it's constant time, even though, for example, if we wanted to find Christiano, we would have to go to this index. You can imagine that this is an array, and the value of the array has, this is a linked list implementation but it could be another array.
[00:07:28] It could be a variety of different things. The linked list tends to be a common example of how to deal with equations. So then we need to find Christiano, we'd look at Charles, we look at Cortra and then we get to Christiano.
>> Speaker 4: So, I'm not fully clear on how do collisions happen?
>> Bianca: Yeah.
>> Speaker 4: Cuz there are inputs, get me right, they share the same first letter but then they are different characters. So why do they have the same hash value?
>> Bianca: Because whatever the implementation of the hash function is giving the same value, it has to be between 0 and 25.
>> Speaker 4: When someone decides they happen to have the same value because there wasn't enough space?
[00:08:30] So you could get the character code for each letter and add them up. That would give you a numerical value for that string. And then you wanna get it within a range, so you could mod it by 25, and get a number between 0 and 25, something like that.
[00:08:51] So that's one way, and you can imagine that a lot of strings would just happen to share the same hash value and the frequency in which that happens is based on the implementation of your hash function. There's a lot of hash functions that are like kind of the common ones, and it's also based on what you expect your data to look like.
[00:09:16] So, you know, if you expect your data to be really long strings or if they are gonna have a lot of the same letters versus not. And so if you're doing a lot of strings that have a lot of the same letters and a lot of the same lengths, then counting the character code and stuff like that might be bad.
[00:09:39] You might wanna use a different hashing function. But in general, no one is gonna expect you to implement a hashing function. It's very like, it's kind of like implementing like a random number generator. Like you should be able to reason about it and know what the challenges and difficulties are but you don't necessarily need to implement it from scratch because they're just very difficult problems to solve.
[00:10:09] So okay.
>> Bianca: So the things I wanted to talk about were the hashing function. How the hashing function works. Collisions, how collisions work, and then resizing, when to resize. You guys feel like you got those three main things? Cool, Jodie you have a question? You're having the question eyes.
>> Jodie: I'm not fully clear on the hash tables. I still need to adjust it a little bit but-
>> Bianca: Okay, can you turn that into a question?
>> Jodie: I don't think I could. I think I need the.
>> Bianca: What do you know about a hash tables?
>> Jodie: Well, my understanding was when you said you're adding up the UTF for that.
>> Bianca: The Unicode?
>> Jodie: Yeah.
>> Bianca: Character numbers.
>> Jodie: I don't really understand how that would work.
>> Bianca: Yeah, so the problem that we're trying to solve is we're trying to turn a string into a number within a certain range. And so the question is what strategies can we use to get there and so for each letter, lets say C is like 24, R is like 33, I is 30, something like that.
[00:11:33] You could loop through this string and get all of those numbers and add them up, let's say it's a hundred. And then if you wanted to be within a range of 0 to 25, you just mod it by 25 and then that would give you.
>> Bianca: Yeah, it would give you the range, so.
[00:11:53] So that's just like,
>> Bianca: But the point is not necessarily the implementation. But you should just know that for every
>> Bianca: For the same string, you wanna get the same value every time for your hashing function, so Anita will always return 0 when the range is 25, always.
>> Jodie: [COUGH] And so this could be framed in a interview based on you could have a question and it requires you to look up certain values, or whatever, and so, even though so far we covered the link table, or the link list and this. It would be like, we need something that's gonna return something consistent like each and every time.
[00:12:43] So, we're like, we can't use the link list because we gotta traverse going back and forth, but maybe the hash table because it's holding like an actual location, set location. In this case, 0 to 25. We can almost count on that being more reliable versus a link list and that's why we would wanna use the hash table.
>> Bianca: Exactly and we can use the keys to look it up so we don't even need to understand that it's 0, 25 because that is just under the hood. What we do is we interact with a hash table just like we would an object. So for saving a value to an object, we would say object.val equals 1, right?
[00:13:19] And that would be an insertion and that would be constant time. That val would go through the hashing function and let's say it returns 1. Then we say in the first index of our array, which is what we're building a hash table out of, we would put value 1.
[00:13:38] And then when we want to retrieve it, we would say object that value and that's how we look up a value on an object, and then it would go through the hashing function. It would put val through the hashing function. It would return 1 and then it will go to that index, it would find the value and then it would just return 1.
[00:14:05] So the interface is the same, it's just a matter of understanding how that happens so fast. Cuz usually if we're going to lock something up we don't have a specific address we were gonna have to do some sort of traversing linear time operation, or log in if it's sorted.
>> Bianca: That's kind of the takeaway here.