Transcript from the "Wordy Unscrambler Overview" Lesson
>> Okay, progressing right along, we're gonna turn our attention to a somewhat different kind of a problem. I alluded to this previously, that we're gonna turn our attention to words. And we're gonna talk about algorithmically solving problems around words. So, this I call Wordy Unscrambler. I don't know if any of you have ever played games like Wordscapes or these other games where you have to mix and match letters, or reorder letters to spell different words.
[00:00:33] I used to play Wordscapes a whole bunch, and I think I got to like level 2000 or something in Wordscapes and I was like it's the same game. It's never getting harder or different or more interesting, so I gave up. Anyways, I used to play a lot of Wordscapes.
[00:00:48] And I also used to cheat at Wordscapes. There are websites out there where you can give it a set of letters that can make a word and it'll tell you all the different words that you could make for that. And so occasionally, I would get stuck on a puzzle and I would go and use one of those cheating websites for that.
[00:01:06] And that's what inspired me to make a problem, because I wondered what would be the algorithmic approach to trying to do this unscrambling of letters into real dictionary words. So that's what we're gonna be tackling in Wordy Unscrambler. So let's think about this, again, we're gonna take this in little pieces.
[00:01:29] We're gonna try to get a working solution, and then we're gonna try to optimize it just like we've done before. This one is almost less about the algorithm and almost more about picking the right data structure. Because the choice of data structure is gonna end up being the whole game here.
[00:01:47] So just as a forewarning in some of the other ones, it was really more about what algorithm is the right algorithm to pick here. It's the right data structure. In general, both of those are really critical choices in any solution that you write. The data structure you pick and the algorithm that you use, they have to be well-matched together, and they both have to be well-aligned to the problem.
[00:02:09] And we've learned that lesson somewhat painfully over the last several exercises. So the same will be true here but it's more heavily weighted on choosing the right data structure. And that is a difficult task because if you haven't seen a data structure, it's hard for you to even conceive of what data structure to look for.
[00:02:26] It's hard to know what to Google for. So really, at least part of this is just exposing you to a few other kinds of data structures that you maybe haven't worked with very much, so that you have more words to use for your Google searches. All right, so let's say that we had the input letters, P, C, E, R, and A.
[00:02:48] And yes, that set of letters is actually my little pun, because that's how old I am, I grew up in the PC era, back before we had all these laptops and stuff like that. I grew up where we built desktop computers and put hard drives in them. And I was working in MS-DOS 3.0 like so.
[00:03:11] I go way, way back, so yes, there's a little bit of a pun there. But if we just started with those letters, P, C, E, R, and A, and we tried to come up with a list of the possible matching words, we would need a dictionary, a list of valid words to be able to match against.
[00:03:29] So if we started out with this dictionary, and we said what words of this dictionary could be made by unscrambling, reordering some or all of these letters? So, it's pretty straightforward that we could run through and see that the word can is not able to be made by those letters, but cap, cape, car, care can.
[00:03:54] Sorry, my check marks are misaligned here, but you can visually shift those check marks up. That we've got cap, cape, car, care, and ear. Those are able to be made, but the other ones are not able to be made because there's a missing letter. There's a missing N in our input, that's why can, there's a missing L, and there's a missing H.
[00:04:21] That's why those words can't be made. So, if we started to try to formulate an algorithm for this, what is one of the first most important questions that we wanna tackle? What occurs to you as a question that we should ask about this problem?
>> First one that occurs to me is how am I checking against words?
[00:04:45] Do I have a big word bank that I can go into and-
>> What is my word list, what is my dictionary, and where am I gonna get a dictionary? That's a very important question, right? Similar to, I don't know the periodic table elements by heart like my kids do, so how do I get a list of periodic table?
[00:05:01] Fortunately, that was simply a quick Google away and I found a JSON file with the periodic table. I thought, surely I'll be able to Google good English words dot JSON and out would spit a nice, well-formatted, great-to-use dictionary. That turns out to be completely not true. It turns out that it's much, much harder than I would have liked to get a dictionary.
[00:05:30] Because it turns out that that's not actually a very well-defined concept. What is a dictionary word? Let me give you some examples. The word laser, L-A-S-E-R. Should that be a dictionary word or should that not be a dictionary word? That's actually a subjective question that not everybody agrees on.
[00:05:54] Why? You might not know this, but laser is actually an acronym. And some people say that acronyms don't belong in a dictionary. And other people say, well, it's a commonly used word, so it belongs in the dictionary. There's a ton of subjectivity here. What about contractions like don't and can't?
[00:06:12] Should those show up in a dictionary? What about plurals and other conjugations of words like with ed or ing? Are those words that should be separately in a dictionary, or should they just be applied or assumed from the conjugation of a word? There's tons of subjectivity here on what should or should not be a valid word?
[00:06:33] There's also offensive words, and how we even define what is offensive is hugely subjective. And I don't even wanna wade into the waters of what makes something an offensive word or not an offensive word from all different angles, religious, and cultural, and so forth. So it turns out that it's not very easy to just Google and say, give me the list of dictionary words and then I can start running my algorithm against.
[00:06:58] You'll notice in the project that I've provided for you that I provided you three different dictionaries. One is a very tiny dictionary that only has 80 words in it. One of them is somewhat longer dictionary of about 2400 somewhat common English words that I actually hand picked all 2400 of those words.
[00:07:18] And then there is a really giant dictionary of 300,000 something words, and that is one that I just found somewhere, and it has a ton of junk in it. It has all kinds of acronyms and abbreviations. That's another one, abbreviations. Should abbreviations show up in a dictionary? Another subjective question.
[00:07:38] So I provided you those three different JSON files because it's going to be useful for us in terms of our performance benchmarking to check against different sizes of dictionary. So in our little application, we'll have a selector where you'll pick the tiny dictionary, the regular dictionary, or the very long dictionary, and we'll see different timings as a result.
[00:07:59] And importantly, the reason why that is in our app, why we're doing that, is because there are different performances that we actually need to concern ourselves with. That's not a monolithic idea, like how fast things run because there's the startup performance, and then there's the run-time performance. And we haven't really talked about that much, but there is a tradeoff sometimes between doing more work up front, things take a little bit longer to start up, but then it runs faster as it goes.
[00:08:30] That's yet another tradeoff axis that we have to be worried about. So when we load up a data structure with our dictionary words, it might be a really fast data structure to build, but it might be kinda slow to walk through. Or it might be kind of a slower data structure to build, but really, really fast to walk through.
[00:08:49] So you can see the trade-offs, and we'll have timings in our app for the loading of the dictionary data structure and for the searching of the dictionary data structure. Before we go any further, there's actually one more critical question that we need to ask to clarify the behavior here.
[00:09:10] And I wonder if it occurs to anyone what that question should be.
>> Can we use letters twice?
>> That is the question. They got it. That's an important question here. What if I have a single P in my input characters, am I allowed to do words that have two P's in it, like the word appear?
[00:09:29] Would we want to allow appear to match here because it's reusing the P and it's reusing the A? That's a question, and it turns out the answer is no, we don't wanna allow that. [LAUGH] We do not want to allow the duplication of letters. Now, that's a choice that we're making.
[00:09:48] You might think it seems like it would be easier to allow the duplication. It's much harder to do this algorithm and to come up with a data structure that allows the repetition. So we're going to choose not to allow repetition of the inputs. If you want to have two P's, then you have to put that in the input.
[00:10:09] But again, going back to the very beginning of the workshop, asking these kinds of clarifying questions, radically changes the kinds of data structures and algorithms that we use to solve the problem. You don't wanna ask that question after three or four weeks of implementation work. You wanna ask that question during requirements gathering.
[00:10:28] I cannot emphasize enough, that is not the product manager's job, that's the algorithmist's job, right? You have to make sure you understand the entire constraints of the problem. You won't understand it, but you have to understand as much of it as you can and ask as many of those questions early.
[00:10:45] So developing the habit of asking those questions is a key skill.