Formatter Exercise

zed.dev

Lesson Description

The "Formatter Exercise" Lesson is part of the full, Write a Compiler That Understands Types course featured in this preview video. Here's what you'd learn in this lesson:

Richard walks through the formatter exercise comments with instructions to make changes to resolve failing tests. The exercises involve formatting array literals, call expressions, and binary expressions by iterating over parse tree nodes. Richard also discusses potential edge cases and considerations in compiler development, such as stack safety and handling different function call scenarios.

Join Now or Start a Free 5-Day Trial

Preview

Transcript from the "Formatter Exercise" Lesson

[00:00:00]
>> Richard Feldman: We have some exercises here. So if you've checked out the repo, you can open up exercises/1/formatter.js. There's a couple of different files in that folder. I'm gonna go through what they all do in a sec. But the format is the one where we're gonna make our changes.

[00:00:13]
So basically, there's comments that have a little arrow finger pointing to them, and what we're gonna do is do what those comments say. So let's hop on over here to my trusty Zed editor. I work at Zed, so shameless plug. And also, I love it. [LAUGH] But basically all of the exercises have the same format.

[00:00:31]
So you've got index.js, and index.js is essentially just a bunch of tests. So it's like, hey, does the const declaration format the way we expect it to? Great, looks good. What if we have one of the type annotation we're not actually gonna implement type annotations in this, course, but we are gonna talk about them, and so the lexer and the parser do support them, but we don't go as far as to actually implement them.

[00:00:53]
String literals like hello world, they format, okay, etc., etc. So there's a whole bunch of tests in here but unfortunately, the tests don't work right now. Well, they don't all pass right now. So if I do node, sorry, I need a cd into their firstm cd exercises/1, node index.js.

[00:01:14]
We actually get some test errors, quite a few of them, actually. Now I really wanted to make this a zero dependency course, you don't have to install anything. So this is our little, extremely fancy test runner. It's test.js, it's one directory up. But basically, it's just like, okay, some of these are not formatting things correctly.

[00:01:31]
Why are they not formatting correctly? Because there's a giant to do here in the output, instead of what should actually be output. So let's go take a look at that. I'm just going to show you these files. We're not actually gonna make any changes them, but just so they're there.

[00:01:42]
This is the tokenization thing for what we just talked about. Basically, this is an actual implementation of this. Like I said, this is not a course about parsing and tokenization. So if you wanna take a look at this code, you can feel free to do that, but you don't really need to for what we're talking about, same thing with parse.js.

[00:01:57]
So every single one of these examples that we're gonna go through in the exercises has a tokenizer and a parse just cuz the compiler needs it. But again, we're not really gonna spend a lot of time looking through it. It's just like, man, if you're curious, it's cool.

[00:02:07]
Here's all the stuff. We got constant, we got return, it's what makes the tokens and then later the parse tree. Now we are gonna definitely spend time with parse tree notes for sure. And starting with right here in the formatter. So here we have our formatter, and if you look at the top, we've got our very first little finger pointing arrow icon that basically says okay, Rod, run node index.js from the parent directory to see which tests are failing.

[00:02:32]
We already did that, but if you wanna do that locally, there's the command and the instructions for how to do that. Then basically it's like, fix the failing tests by resolving the comments for this down in the file. So if you scroll down, you can see there's a bunch of stuff that is actually correct and working.

[00:02:44]
And there's a bunch of logic around, I mentioned big switch statement, right? It's like doing LOL. We got a block, we got a string literal, numeric literal, you don't need to know how all this code works. Mostly, the point of these exercise is we actually do want it to work.

[00:02:58]
So you can actually fix it and watch the test pass. So again, if you wanna sorta poke around and see what this is doing, I would not claim this is like production grade compiler code [LAUGH] but it is at least functioning well enough that we can see what it's doing.

[00:03:11]
Okay, so once we get down to the array literals, this is where we've got the problems. So basically, what we wanna do is we want to resolve all of these little finger emoji icon. So there's one here for format array literal. This is right below a format return statement.

[00:03:28]
Then there's another one for format call expression. So this is when you're calling a function. And then finally, we have a format binary expression. That was last one, right? Yeah, format binary expression. So binary expression is essentially where you have, in our case, it's just + or *.

[00:03:43]
Most languages have more than just plus or times, but that's all we've got. But the important part here is that in each of these examples, I've given you sort of the structure of this node argument. So you can just see, okay, node, the this argument you've got right here is going to have an array of elements, and each of those elements is a parse tree node.

[00:04:01]
So what you wanna do in the exercise is essentially end up with, instead of returning this TODO string, you wanna return a string that looks like this. Based on iterating over these elements that you've got in there, each of which is one of these parse nodes. Same thing kinda here, this same basic idea except instead of an array, now you've got a call expression.

[00:04:20]
So this has actually two different fields on its structure, so one is callee. Callee basically refers to the function that you are calling. Note by the way that, I mean, typically we think about this as like, foo(arg1, arg2). But one of the many edge cases that comes up in compilers, which we're actually our compiler is not gonna bother implementing, but we are gonna format it in case someone writes it that way.

[00:04:40]
Well, actually, let me put it to you, can anyone think of another way that you can call a function that's not involving putting a name here?
>> Student 1: Lambda functions?
>> Richard Feldman: Yeah, exactly, lambda functions. So you can do a is whatever, and then immediately call it, right? So that's why callee is a parse tree node and not just a name, cuz it could be.

[00:05:00]
This whole thing right here could be this gigantic expression, and that has to be supported. So this is something that definitely comes up in production grade compilers. So the callee is a parse tree node. It's not just a name, but it might be a name. But we have to treat it as though it might be anything.

[00:05:15]
And then of course, you have the arguments that are coming in. And so basically, unlike the array example from earlier, we actually need to do a traversal of the arguments, but then also deal with the callee separately. And then finally, we have the binary expression, which is basically just operator, like plus or times.

[00:05:30]
And then there's the left, which is basically the thing that comes before the operator, and that's gonna be a parse tree node, and then right, which is another parse tree node that's on the other side of the operator. And then there's the operator name itself, which is just gonna be a string, like cluster or time.

[00:05:40]
Something like that.
>> Richard Feldman: Cool, let's go through the solutions. All right, so first we had our format array literal. By the way, there is a solutions directory. If you are totally stuck and just wanna take a peek, you can. But it's definitely better for learning if you don't.

[00:05:58]
[LAUGH] So I'll use it to go through the solutions, but only use if absolutely necessary, or later on if you wanna go see how it worked. All right, so we're gonna skip over this sort of edge case here first and come back to that in a sec. The main sort of bread and butter here is, you got, okay, I want to take my node node.elements, you may recall from this thing, this right here, it's an array of parse tree nodes.

[00:06:21]
And just map over it, and then for each element, we wanna call formatNode on element. So formatNode, if you poked around in the source code file, this is sort of the generic I've got a node, and then I want to turn it into the source code format, an actual string.

[00:06:36]
And one of the things that formatNode does is in its gigantic switch statement, is actually calls our array literal here. So we are actually doing something recursive here potentially. If we format an array literal inside array literal, we go in here. In the middle elements.map we call formatNode, formatNode comes back down here, and then how many times we have nested arrays that'll just keep running.

[00:06:58]
After you got done with that, then we're just doing a join with a comma. Did anyone try to get fancy and make it so that it did the multi-line thing that we talked about earlier? I wouldn't either for ten minute exercise, but you could. If you wanted to get fancy with it, this is exactly where you would do that.

[00:07:14]
And then do the thing where you record in the parse tree node, and then you have to escape the formatter, and go back into the parser to actually write that down. But then at that point, you could just look right here, sorry, at this node, this one right here, and just say if node.isMultiline or something, has trailing comma, then you could just branch right here.

[00:07:34]
And instead of doing this, you would do an indentation and join on new line, and stuff like that. Cool, and then, of course, we have the edge case, which is that if the length is zero, then you don't want to actually put spaces in the middle of in the case of an object or something like that.

[00:07:49]
Now, in this particular case, if you didn't do this, nothing bad would happen. But there are quite a lot of scenarios where that does matter, records or objects being the main example. So usually, the way that formatters will do those is something more like this. But of course, you don't want your empty object to look like that.

[00:08:06]
So that's where that edge case comes up. In this case, it's more of a just habit of putting the edge case in slash, maybe a performance optimization, although in this case, not really. You're just saving yourself a string interpolation. But yeah, the main bread and butter is just this.

[00:08:21]
So for our purposes, we can just copy this over and demonstrate that'll just work. Okay, so now we should be down one of these arrows. So here we had all these tests failed, and now a slightly shorter list of tests failed. Hooray, we're making progress. Any questions about that one before we move on to the other two?

[00:08:41]
All right, so next one is the call expression. So we have the callee, which is a parse tree node as previously mentioned, not an identifier. Then we have the actual arguments, which is what we're gonna need to iterate over, just like we did with the array. So here's what that can look like.

[00:08:53]
So, first thing we do is we call formatNode once again on the callee, and that'll get us back a formatted string that we can just later on, using string interpretation, so this interpolation. So this will be either foo, if it's the name of the function, if it's an identifier node, or it could be something really fancy, with parentheses around it.

[00:09:10]
Either way, this is just gonna work. Worth noting that if this were a production compiler, this is the part where you would probably need some logic to try to answer the question, do I need to insert parentheses here? Because this parse tree node might not, depending on how you're representing things, might not record whether or not it had parentheses around it.

[00:09:29]
And if you've discarded those parentheses, you might need to strategically reinsert them. So you wouldn't wanna put these if it's an identifier, like foo, then it would be really weird if you had. I mean, it could work, but nobody wants their code formatted like this. But if it's something like, it's (x => x + 1) or something like that, then you need the parentheses.

[00:09:48]
Because otherwise, if I don't have the parentheses there, then it's gonna break. Now I'm trying to call 1 as a function. So this is another good example of where these are the types of edge cases that we're not gonna bother handling in our compiler, but I am gonna call some of them out along the way, just so you can be aware of them, yeah.

[00:10:06]
>> Student 2: As a design decision here, are we guarding upstream for null inputs? So if I just call the function without inputs, will it give me something if I click Python where it gives you the object, or do we throw an error-
>> Richard Feldman: I see, so basically, question is like, so if I don't pass in any arguments, what happens in this compiler?

[00:10:27]
This compiler accepts that. We actually have first class support for zero argument, like function calling. So this can just be an empty array when it comes in but yeah, different languages like handle those scenarios differently. But yeah, in general, with the one exception, arguably, of generics, we are gonna go through generics, aka parametric polymorphism, in our type checker.

[00:10:48]
But that's basically mainly because if you're doing Hindley motor type inference, it's either free or necessary anyway, depending on how you look at it. So it's not that fancy if you're doing a Hindley motor type checker. But other than that, in general, if you're wondering, are we gonna do this fancy thing?

[00:11:02]
Probably the answer is no. [LAUGH] It's gonna be a very, very un fancy subset of JavaScript we're compiling to here. All right, so in this case, there is no optional initial edge case that we might or might not do. This is just all necessary. This part is basically identical to what we did in the previous one, so with the one exception of the extra formatNode there.

[00:11:21]
So let's run that again and, hey, even fewer tests failed and I think one fewer, great. And finally, we have the format a binary expression. So here we have three fields that we're dealing with. So left operator and right. Now, I'll show you an example of how we could implement this incorrectly, but it's pretty obvious, pretty much right away how this is incorrect if you do it this way.

[00:11:44]
So format the thing that's to the left of the operator, format the thing that's right to the operator. If I were to just say left and right, well, I'm missing something pretty important for an operator, which is the actual operator. So if I run this, I did not make any tests pass because although we don't need to format the operator.

[00:12:05]
So if I try to do this, I will get a different error. Operator = formatNode(node.operator);. Now, the operator is not a node. This is actually just a string. So if I try to do this, yeah, at times still have a failure. I'm not gonna scroll up and find what the exact error was.

[00:12:21]
But the point is, this one, we want to not formatNode. This one just is a string. So what we actually want is just that and hooray, now all of our tests pass. For extra credit, you can turn it into a transpiler, just kidding. But if you want to, you could, actually, after the workshop [LAUGH] go do the thing we talked about, add some piece of syntax sugar, but that requires changing the parser and stuff, so.

[00:12:41]
Are you afraid of blowing the stack with recursion? [LAUGH] Are you afraid of blowing the stack with recursion? Yeah, right. So this is where formatNode is potentially calling from an array literal, etc. I mentioned this is recursive. It's a very good question. So blowing the stack would basically mean what if this recurses so many times that you run out of stack space and then your compiler crashes?

[00:13:01]
That can't happen. That's a real thing. I have definitely used production grade, very fancy, millions of dollars have gone into them compilers where I managed to, in some cases, on purpose cuz someone didn't believe me that I could induce one. But also, in one case accidentally, gotten a stack overflow from the compiler.

[00:13:17]
And so if you want to, it is possible to avoid this happening. The trade off is that it makes your code a lot more complicated. So basically, if you want to make your compiler stack safe, where you're not worrying about blowing the stack, what you do is you essentially simulate the stack where you basically make an array, for example.

[00:13:33]
And instead of calling the function recursively, you have it in a big loop, and you just are pushing things onto the stack. and then when you finish your operation, you pop them off the stack just like functions do, just like much worse ergonomically, and then the trade off being that although the ergonomics are worse now, you don't have to worry about blowing the stack.

[00:13:49]
So, am I worried about it? Not for this example, certainly. [LAUGH] But I am actually worried about it on production grade rock compiler. And that is actually something I do try to use the less ergonomic, but stack save option. But I think I'm kind of in the minority on that.

[00:14:04]
I think most compiler authors would just be like, just for courage. Nobody's gonna write something that has that much depth, anyway. If they write a gazillion nested array literals, they get what's coming to them. But I don't know, so I guess I am unusually worried about that, but most compiler authors aren't.

[00:14:21]
Great question. Any other questions before we dive into name resolution? I'm curious, do you know if any languages that implement that kind of protection against the stack blowing up, so to speak? Go comes to mind. So the question was, do you know of any languages that protect against that sort of stack blowing up?

[00:14:41]
So there's trade offs here too. So Go has this thing, I know it was originally called segmented stacks, and I think I read about them moving to a different approach, but I'm not sure if it's still called segmented stacks or not, or even if they did move to a different approach.

[00:14:53]
But basically, without going out a huge rabbit hole about [LAUGH] stacks. The basic idea is that if you want, you can make it so that every single function that when you're in the code gen phase, when you code gen your functions, you can insert a little bit of code at the beginning and a little bit of code at the end that basically says, okay, I know exactly how much stack space I'm going to use.

[00:15:14]
I know how many variables I have, what's the maximum amount of stack space I can use? And before I start running my actual function body, I'm just gonna check and see if we got enough. And if we don't have enough, and you track that with these are all your uses of stack space.

[00:15:26]
And if we don't have enough, I'm gonna do this magical actually allocate some new memory on the heap, and I'm gonna move all my stuff over to that and start running over there. Now this does mean that you can effectively guard against your stack ever running out of stack space.

[00:15:40]
It also means that if you recurs indefinitely, you're just gonna run out of heat memory instead, which is less likely to happen but even scarier. [LAUGH] Because you run out of stack memory, the consequences are much less severe than running out of heat memory. Plus it's probably gonna take a long time to run out of heat memory, so it's gonna be the symptom will be, your program hangs for a long time and then super blows up, whereas a stack overflow happens, it gets resolved much faster.

[00:16:05]
So, trade-offs. [LAUGH]

Learn Straight from the Experts Who Shape the Modern Web

In-depth Courses
Industry Leading Experts
Learning Paths
Live Interactive Workshops

Start a 5-Day Free Trial