The Last Algorithms Course You'll Need Tries
Transcript from the "Tries" Lesson
>> So if it's not a priority queue, it's probably a trie, or really, it should be pronounced tree, because it actually is named after a retrieval tree. But somebody really think about that, a tree tree doesn't sound very good. So we have a tree tree called the trie tree, and sometimes they're called prefix trees or digital trees.
[00:00:18] I don't know why they're called digital trees, sounds kinda weird, sounds like a bunch of zeros and ones to me. So this happens a ton. I have been in at least one interview where a trie tree is the answer. And if you don't know what a trie tree is, the easiest way to think about it is an autocomplete.
[00:00:34] So if I type in A, what should you give me back? What kinda results should you give me back? That's just a trie, that's all it is, and it can do it in O(1) time given that the keys meet some sort of minimum length or some sort of length, that they fall within, like the English words, an English word can only be so long.
[00:00:53] There's not a thousand-letter long English word, so therefore, you can't have that, so constant time lookup. We're not gonna implement one just cuz it's a practice in pointers, and we just don't need to do yet another doubly linked list. But instead, we will go through and we will whiteboard one out, they're actually pretty easy.
[00:01:15] So let's go to right here. All right, so we're gonna do a trie. People call it trie, I still like to call it a tree cuz it really truly is a retrieval tree. All right, so how it effectively works is that you have a root. The root does not have any value itself, but what it does have is, by the way, I'll just kinda preface this, we're gonna do an English language tree, right?
[00:01:40] So that's what we're gonna be doing. So if that's the case, that means our tree could have 1 of 26 possible symbols as its children, unless of course, we're also considering underscores or capitalization is case sensitive. If we're considering those other things, then yes, there'll be some other stuff, but usually when you're doing spell checking, you don't necessarily have to consider those things.
[00:02:01] So let's just pretend we're building a spell checker, we'd only need 26 characters. And why I say that is let's say we want to add the word cat to our trie. Well, what we're gonna do is one of our 26 children, which is the c, will simply be a child.
[00:02:19] And then from there, we're gonna go from c and we're gonna add a child a, and from there we're gonna add a child t. And now we're gonna mark t in such a way, and there's two methods for marking t. One of them is that it has one last child called the asterisk child, and that denotes, hey, this path that you're on is a word.
[00:02:38] The other way you can do that is that the nodes themselves contain some sort of is word Boolean flag. I kinda like that approach better than having this star approach because then you have to go check the 27th child. And then if the 27th child exists, then you know for a fact I'd just rather have a more stronger signal than an implicit one.
[00:03:00] So there you go, so we have cat, it's been marked as a word. Then we want to add cats, so how do we do that? Well, we go, okay, we have a c link, we have an a link, we have a t link, well, let's have an s link.
[00:03:17] We now have cats, we've now marked that node also as a potential word. So then of course we add cattle. Well, that's gonna do the exact same thing again, except for at t, we're gonna do this, t, l, I didn't realize it ends right there, E. Well, good thing we're not doing the asterisk approach cuz I would have just got my asterisk in trouble.
[00:03:43] All right, it's a pretty good pun by the way. So there you go, is now marked these three words and they all start with c. So if we were to add, say another one say card, you could obviously see that we could do an r and then a D, right?
[00:03:58] And D would be marked, and then we can also just add other words in, such as MARC, right? We'd have an M, A, R, C, MARC, there we go Mark the name, we have it right there, that's your name. So there we go, we actually just added it to the tree, now we can traverse and find out.
[00:04:18] So if someone starts typing C, what do we have to do to give you some amount of autocomplete back? Well, we're gonna have to start traverse to see do we have an element that starts with, do we have the C branch? Yes, we do. Awesome, so I'm gonna start here.
[00:04:35] So often what this is done is usually the C branch in this case would represent index 2. There are some easy ways to be able to find this out, I figured I'd just throw this in here for fun. Here's the fun little thing for those that haven't done a lot of C programming.
[00:04:49] Here's a great way to know, so you can have zero = "a".charCodeAt, right? So zero is 97, and then if you had a C, you could create a function called index, which say you pass in a string, and this is simply going to return str.charCodeAt(0)- zero. And now you have the ability to go idx("c"), still, right?
[00:05:31] But if you have a proper language that can give you an integer value out, you can offset into it and do these kinda cool little tricks, right? All right, let's go back here. So that means if we were to go to C, someone types a C, and we want to display a list of words, how are we gonna do this?
[00:05:48] Well, what we're gonna have to do is we're gonna have to do a depth first search, right? We can technically also do a breadth first search, depends on how we want our words out. If we want them, say sorted naturally by, say alphabetical ordering, if we do a depth first search, we will always get ourselves the alphabetical ordering, which is actually pretty neat.
[00:06:08] That's all we have to do is do a pre-order traversal, as we see a word, we put it into our results. And there we go, we'd actually get cat, cats, cattle, card, and that'd be great. So we'd actually get them in alphabetical ordering. But if we were to do this, we have to simply traverse down, and add each one of these words to our list.
[00:06:27] We'd come back up, come here, come here, my goodness, where's my pen? Come to right here, and there we go, we found cattle, we found cat and cats. And lastly, we'd come back out here, then come back up here, and we'd find card as well, and now we have a nice list of words to display to a user.
[00:06:44] But another challenge is obviously there's usually a huge set of words. So how do we kinda find the right words? Well, another thing you can do with a trie is you can also add a score or a frequency, and these are things that can be stored really at runtime, all that.
[00:07:00] So if you're building a trie because you have these results, but you wanna make sure that the users' most selected items are getting more often at the top. Have you ever used autocomplete on your phone and you try to type in the word time and it autocompletes to tine, the little pokey part of a fork?
[00:07:14] And you're like, how did you get that when clearly time is the answer, why tine? It's because they have a poor scoring algorithm. That used to happen all the time on Android phones, when you'd swipe type time and you'd miss, it didn't correct correctly, it's just frustrating. But nonetheless, that is effectively how these things work, it is a simple depth first search.
[00:07:33] We've already done depth first searches, we should already know how to do this. We've already done pre-order traversals, this really isn't a hard thing to do, right? Now that we have the tools to do this, it is quite simple to be able to build up this tree. And so insertion is really the hard part, and deletion is really the hard part.
[00:07:51] Insertion is just a simple iterative loop where you go to the current node. So we can do this, we can kind of write this out in pseudocode, right? So insertion, we really need to just start at a node. Really, that just needs to be the head. So we don't even need to make this.
[00:08:05] We need to insert some sort of string, right? So we can have our current equal our head, and then we can simply walk through the character list. So we can do a little for c in str, right? And at this point, we simply walk to each point. If current has the character, then we can just simply set current to the character.
[00:08:34] Else, we're gonna have to create it, right? So we could create a new node, and I'll just have this thing called create N, right, create node. Very, very simple, right? Look at that. Then we could do current, this character equals node, I should have called it n if I was smart, which I'm not, and then current equals node.
[00:08:57] There we go, that's really all insertion is, is we're just walking to the end inserting if we need to insert. And then of course, current, well, that's in the for loop. Outside of the for loop, we need to say curr.isword = true. And there we go, we've just inserted into our trie.
[00:09:19] It's actually not all that hard to be able to write this data structure, it's immensely useful, and often you're gonna find that you get asked some variation of this question, probably in an interview. I really, really like, we've asked them at Netflix, they're fantastic. Deletion, of course is the exact opposite, easier to use recursion for deletion, because you want to get to the point of the node.
[00:09:42] And then in post operation, you want to delete your way back up. Remember how I said that there's three parts to recursion, pre, recurse, post? This would be a post operation or a post traversal that makes a lot of sense. Long as you have your parent and yourself and the expected node in which you, or the expected link in which you want to be able to delete.
[00:10:04] You get all the way to it, get the link you wanna delete, delete it from your parent as you walk back up. And so it makes sense, cuz if you delete before you go, you're only gonna make one deletion, and then the whole tree is gone. You don't wanna do that, that sounds terrible.
[00:10:17] And of course, you only delete if you're the only link left, right? Or you could just leave the node around, right? Depends on how you wanna use your memory or not. That makes deletion way faster, it makes it a constant time deletion versus a constant time deletion, but there's a constant of times two on that deletion cuz you have to delete back up.
[00:10:34] So it's a little different, just things to trade off, things to be able to talk about, things to be able to kind of explain to someone why you choose one way versus the other. If memory is not a problem, leave it, if it is a problem, remove it, you get the idea.
[00:10:47] We always had to delete it at Netflix cuz often we'd run on small devices, so we'd actually use a special data structure that is like a trie. That's how we did our caching, I explained Falkor yesterday. Falkor is effectively just a trie, that's all it is. It's a big trie that is represented as a tree and each one of the keys in, that's how you move around.
[00:11:07] So it was pretty fun, it was pretty neat, so I've used a trie in my real life as well. I just kinda wanted to walk through it and talk about it cuz they're super cool. They're not all that fun to implement, but they are super cool. So there we go, everyone good to go, do we have any questions?
>> Someone did mention this comes up in interviews, and I'm interested, what does the interview question look like on a test?
>> Interview question is often like, I have a list of words and I wanna build an autocomplete, when you hear the word autocomplete, be like, gotcha, right?
[00:11:38] Or I wanna build a caching mechanism, often caching mechanism if you can see a path that they want, and the path can vary in various points kinda like the Falkor example, right? So if our path to our data looks a little something like videos, right, Id, title, and we want to cache this in some sort of graph, Id is the thing that can vary, right?
[00:12:02] We can have one, two, three, we can have four, five, six, we can have seven, eight, nine, so obviously we're gonna want something that has like say, videos, does videos even exists? It does exist, does this Id exist? This Id does exist. Does title exist on that Id?
[00:12:14] It does, here you go, here's your cache. That Id doesn't exist, sir, we don't have it, we don't have title, we don't have it. So it's usually like a caching slash autocomplete problem that you're often gonna see this in.
>> What was written in the first line after the else?
>> The first line of the else, yeah, yeah, that was some super pseudocode right there, right? I simply just said, hey, let's create a node, right? What is a node? A node of course is gonna look something along the lines of isword, which is a Boolean value, plus a children, which could be a 26th-sized array of nodes, if we're doing an English autocomplete.
[00:12:53] So it's kinda like a hard coded, this is the size of what we're gonna be doing. If it was a language in which did that of course, these would be pointers. You could just keep on creating that or else you'd explode your memory until you die. So there you go, that would be creating out.
[00:13:08] We got another question?
>> Yeah, sorry, you mentioned this is constant time.
>> So it is constant time cuz you have to determine what is N? N in this case is not the string we're inputting into the algorithm, it's the amount of nodes. As we add more and more nodes, as we add hundreds upon hundreds of words, does our lookup time change?
[00:13:31] It doesn't change, it's based on some finite predetermined amount of length. If the longest English word is 12, then you're right, it's 12 node checks and that is it. So that's a constant time is that's how you can think of it. Often you'll run to these algorithms where N may not be what you think it is, you can kinda swap it around in a unique way.
>> So it's technically the height, but the height is bound by the English dictionary?
>> Yes, so Graham just said it's technically O(H), H being the height, but height is usually bound by some maximum sequence, yes. I'm sure if there's a way in which the height is not bound by maximum sequence, then it would now then be O(H), right, if you had an ever running longer one.
[00:14:16] But I'd probably question that you're probably not using the right data structure if you have some ever longer running string that keeps on getting bigger. Probably not the right way, cuz now you're just gonna have a huge amount of memory being used.