Practical Problem Solving with Algorithms

Implementing a DAWG Data Structure

Kyle Simpson

Kyle Simpson

You Don't Know JS
Practical Problem Solving with Algorithms

Check out a free preview of the full Practical Problem Solving with Algorithms course

The "Implementing a DAWG Data Structure" Lesson is part of the full, Practical Problem Solving with Algorithms course featured in this preview video. Here's what you'd learn in this lesson:

Kyle implements the directed acyclic word graph. A third-party library is added to the project and the loadWords function is refactored. The option-4 branch can be used as a starting point for this lesson.

Preview
Close

Transcript from the "Implementing a DAWG Data Structure" Lesson

[00:00:00]
>> The first thing that we need to do is load up this libraries. You can copy this from the option for branch in the same way that we copied in the object pool if you'd like to. This one unfortunately doesn't get distributed as an ES module, so we're just gonna load it in the old school, Script tag way.

[00:00:26]
But you do need to make sure that it is loaded. In our wordie, we're going to take out the isWord and we're gonna take out the pool. And our dictionary is going to be a minimal word graph, so we're gonna construct an instance of MinimalWordGraph. Our loadWords function, we're gonna be able to take out all of this code because the data structure is gonna take care of all the work of constructing our data structure.

[00:01:15]
So really, all we need to do is double check to make sure we don't need to empty it out. So we're gonna say if the dict.size() > 0, then recreate an instance of our MinimalWordGraph. And we're going to call for let word of wordList. We're gonna go through each of the words.

[00:01:39]
And this is the nice part is that we're just gonna call their method dict.add and pass in the word. So that data structure and their API is taking care of all that complexity under the covers for us. There's one final step that we wanna do, and this is basically the step where it does all that compaction and memory efficiency stuff.

[00:02:04]
Gets rid of all the intermediary steps, they called it makeImmutable, which is actually one of my favorite API nicknames I've ever seen in a library. It's just very straightforward what it's doing, it's making the word graph immutable. What we should take from that, importantly, is that once we have constructed this word graph and done all of the compression to get the nice, efficient memory usage and all that, we cannot add any more words to the dictionary.

[00:02:34]
In our case, that's no big deal because we pre-loaded the whole dictionary. There might be cases where you need to be able to add words to the dictionary as things go along, in which case this is not the right data structure to use. Cuz this data structure really gets its benefit from the ability to have compacted it, but that's a one-way operation.

[00:02:55]
Once you've done all that compaction, you can no longer add stuff to the graph cuz it just breaks things. So for us, that's good, we go ahead and make it immutable and we return the dict.size. We are going to, Yes, question.
>> How does it reduce the memory usage while the same nodes still exist?

[00:03:35]
>> It does not create all of the same nodes, it removes any unnecessary nodes. And the way that it removes the nodes is not to actually combine nodes, but to create additional links. That's what makes it a graph versus a trie tree. But it does actually literally have far fewer nodes in memory.

[00:03:57]
We will actually see that once we run our working solution, we'll actually be able to see the number of entries in the data structure is drastically lower than in the trie tree, yes?
>> As an aside, this is where I work, and the typical graph in production has millions of nodes and billions of edges.

[00:04:16]
So you are going to have a lot more links than entities in general because it's a much more efficient strategy.
>> And references are a pretty small value for JavaScript engines to keep track of. Is it creating more memory to keep all those references? Yes, but it's way less than creating a big old node of string value in it or something like that.

[00:04:39]
References are pretty small. We're gonna call the API that they provide, and they actually have several of these you can read through. Fair warning, the documentation for this particular library is not as great as I wish it would be, and it doesn't seem to be a super well maintained library, so your mileage may vary with using it.

[00:04:58]
But I was able to get this to work. If they don't say this in the documentation, you absolutely have to pass an array. Implies that if you pass a string, everything's fine, and I can tell you, you have to pass an array. [LAUGH] That I know for sure.

[00:05:16]
So here I'm just making sure that we are in fact passing in an array if we've received a string, then I'm just turning it into a string, I mean, turning it into an array with split. So containsOnly is the method name on that data structure that basically does the search that we're talking about.

[00:05:40]
That's our findAllWords functionality, basically. So we're gonna be able to take away findAllWords. We still have the same problem, their data structure still has the same problem, which is that they might end up returning as duplicates if we passed in, Duplicate input letters, they might end up returning as duplicate output letters.

[00:06:12]
So in the code repo, you'll see this line of code. Somebody asked earlier, could we just deduplicate the input letters before passing them in? In our previous algorithm, that would not work. But something that's interesting about the way that containsOnly works is that it does not care, it will find all of the matches even if you didn't provide enough of those letters.

[00:06:43]
So technically, we don't actually need line 35, this is the way it is in the repo. Technically, we actually could have solved line 35 by deduplicating what's in input. So you could do it that way, pass in a deduplicated set of input into containsOnly, and then you won't get those repeats out, and then line 36 wouldn't be needed.

[00:07:09]
There is, however, whether you deduplicate the input or you deduplicate the output, there is, however, another annoyance that we have, which is that they were matching words that we said at the beginning we don't want to allow. In other words, they would have matched appear because they allow that duplication of those letters.

[00:07:30]
So it's both a good and a bad thing that their search algorithm works that way. So now we are going to be returning words that our previous algorithm would not have returned, because we would allow a single p as an input to match the word appear even though it needs two p's.

[00:07:47]
So to fix that up so that this code works the same as the code before, you could make a decision at this point that I'm using a data structure that allows the repeats, maybe that's how I want my app to work. But just so that we are comparing Apples to Apples, we're gonna show, we kind of hinted at this, but we're gonna show a way of ensuring that we don't end up with any of those false positives.

[00:08:12]
And that is that we're gonna implement a utility called countLetters, okay? And this is obviously a very down and dirty way of doing it, but we wanna go through any string and count how many occurrences of each letter there are in the string. In fact, this is a classic interview question.

[00:08:32]
And I'm only showing one very down and dirty implementation that you could do for counting the number of letters in a string. So we're gonna go through each character in the string, and we're going to create a reference in this counts object for that character. So we'll say str[i], And then we're gonna say, We're gonna set it equal to either, I'm just trying to write it the way I have it in the repo so that the code compares here.

[00:09:13]
This is just a way of code golfing, writing fewer lines of code. What I'm basically saying is, if it's there, use it. Otherwise, default to 0, and then add 1 to the count. So I'm incrementing the count, but I'm defaulting it to 0 if it hasn't already been said in there.

[00:09:33]
And then return counts. So for the word appear, it would return as an object that had a1, p2, I'm sorry, a2, p2, e1, r1. Everybody follow that? That's what this function does. We're going to need those counts for our input. So I'm gonna come back here and get those at the beginning.

[00:10:02]
I'm gonna say inputCounts = countLetters(input). If you did the deduplicating of input rather than the set approach, whichever way, if you did the deduplicating of the input, you would wanna do that after you counted the letters, right? So do that deduplication here on line 38. Okay, so to fix up our result set so that it filters out the words that we should not have allowed in there, that their search should not have allowed, we're just gonna do a plain old filter function.

[00:10:40]
Again, just trying to kind of quickly sketch this kind of logic out. For each word in the return set, we're gonna get its counts. And then we're gonna go through all of those letters and compare the count to what's returned to the counts of our input, and if they don't match, then we're gonna reject it.

[00:11:13]
So for, Take the letter and the count of Object.entries. WordLetterCounts. So we're saying, oops, I misspelled that, we're saying this object that got the counts for us from this particular word in our candidate returns, we're gonna go through each one of those entries. So we're gonna go through the a, the p, the e, and the r entry in that object, for example.

[00:11:52]
And we're going to compare the count that was returned from this word, so p2, to the count on our inputCounts object. So we're gonna say, if there wasn't that letter, which this should never happen, but this is just kind of a, if there wasn't that letter at all, that's just sort of a sanity check.

[00:12:15]
But it shouldn't ever happen that it gives us back word with a letter that wasn't in the input set, or the count is greater than what it was in our inputCounts. If either of those things is true, I got one too many parentheses there. If either of those things is true, we can reject this word because it should not be in our list, so that's return false.

[00:12:42]
And if we make it all the way through all the letter counts and everything was fine, we return true to keep it. Thumbs up, thumbs down, how do we feel? Does that sound all right? It's a little bit regrettable that their API doesn't allow a flag that says, don't allow duplicate letters or something like that.

[00:13:09]
It'd be nice if their API did that. This particular case we just have to clean up the results a little bit. So it's kind of a paper cut, but it's on a much smaller set than the input set. So as algorithms, we're not too worried about the performance hit there.

[00:13:30]
If I did not make mistakes, if I was able to successfully type that in, we should be able to go back to our page. Oop, I did make a mistake on line 30. Aha, this colon should have been a semicolon. Another mistake, line 40. That's a question mark, sorry.

[00:14:06]
I didn't even finish that line, I got ahead of myself. Array.isArray we just return input and then colon. Okay, right off the bat, notice that our 2,400-word dictionary is now being represented by directed acyclic word graph that only needs 1,562 entries. So we see it having fewer entries than there are letters in there, and that's because it's able to do all of that sharing, okay?

[00:14:40]
If we go up to the 370,000-word dictionary, you'll notice that it did take 810 milliseconds to create that instance and make it immutable, do all the compaction on it. This is a pretty large dictionary as dictionaries go, but that is a noticeable impact on startup performance, something that you would have to think about.

[00:15:03]
Can I do this in the background thread or hide it somewhere or whatever, okay? So just be aware of it, it's not 20 seconds or something crazy, that's completely impractical, but it is something to be aware of. Notice here, though, instead of having the 1 million entries that we did in our trie tree, we now have 160,000.

[00:15:22]
We have less than half the number of words in our dictionary. So we went from being three or four times as many to being less than half, pretty significant memory savings. So again, trade-offs. Better memory, a little bit harsher penalty on the startup time. Let's go back to the 2,400-word dictionary.

[00:15:49]
Let's try our examples again, like the word America, six milliseconds. We were seeing around one millisecond before, so slightly slower. We wanna make sure that we don't see growth though, so let's try it with nine? Still pretty quick. Still very quick. Even something much longer, Right? So we see that it's still not going in a runaway growth, still performing pretty much linearly.

[00:16:32]
And that's what we know that we get from those algorithms like the traversal of a trie tree, or in this case, slightly more complex data structure. But under the covers, that containsOnly is fundamentally doing a similar algorithm to ours.

Learn Straight from the Experts Who Shape the Modern Web

  • In-depth Courses
  • Industry Leading Experts
  • Learning Paths
  • Live Interactive Workshops
Get Unlimited Access Now