Building Your Own Programming Language

Lexing White Space & Parenthesis

Steve Kinney

Steve Kinney

Building Your Own Programming Language

Check out a free preview of the full Building Your Own Programming Language course

The "Lexing White Space & Parenthesis" Lesson is part of the full, Building Your Own Programming Language course featured in this preview video. Here's what you'd learn in this lesson:

Steve gives a preview of the first exercise, of the repository, and dives into creating a token that breaks a string into an array of tokens.


Transcript from the "Lexing White Space & Parenthesis" Lesson

>> Steve Kinney: So like we'll have something along these lines here. It will be fun if you mess up because what will happen is either your test suite will take longer than you think it normally should or you can really make noce unhappy and get some nice like dumps of the stack, right?

It's not pretty, right? And so, that's a thing. Hence, we're gonna have some unit tests. So if things change under us, we kind of know. And then some of this is for instance, okay, if we see white space, we don't care about white space. Right, well, we will write this code together in a second.

It's kind of the preview. But if we see a number, well, yeah, we do need that, right? We want the number. We put that into our token bag, and we move along. So we're gonna switch over right, I'm going to jump into the repo. I'm going to write a little bit.

And then I'm going to pass it off to you to write a little bit, and then we'll do it together. And it'll be a grand old time. So this is I'll give you a kind of a tour of the repo as well. We've got most of the code is in the source directory, right?

It's there's I've done the like, require statements, but it's a bunch of no op functions in here currently, and a series of skipped tests, right? And so we'll start turning those on as we go through the day starting to get a few things working. So right now I'm in this tokenize, which is what we're going to kind of call our lexical analysis.

We have some helpers here and let's just like they're pretty straightforward I mean, they do effectively what they say on the tin. But let's go ahead and take a look. And so I've got some basic tiny little bit of regx to do some of the very simple stuff like figuring out if something is a letter, white space, this will take care of tabs as well as spaces and everything on those lines.

Numbers, we are actually gonna use the operators. This is just some helpers to make our code a little more clear. Seeing this regx all over place with the test on it, like makes an already complicated piece of code look far worse, right? Giving yourself good abstractions sometimes is a sign that you love yourself.

The same thing of this will be easier for future me to read than the other side of this, so on and so forth, right? So we're gonna have these are the helpers that we're gonna use for the lesson, we're gonna have some other helpers through the course of the day.

But I greatly encourage you to make helpers early and often, right? Sometimes we'll use a little less than I would normally use, cuz I wanna make stuff clear. Righ, and it's like what is this magical helper right? But i think about a lot of like even when we create the like kind of objects that are the data structures for the information about a given token.

Or piece of, I'm sorry, syntax tree, right? I might normally use a function to actually create that so I know I'm consistently getting the right properties. I don't actually worry about having to mistype a given constant name or something along those lines, right? And so we'll do a little less of that than I would normally do so it's not so abstract that it seems like black magic but we will kind of use it where it doesn't take away from the clarity.

But I encourage you to probably go ever farther than I will go cuz I scaled it back a little bit than what I would normally do. So we have those. But tokenize is where we're going to kind of get started and we saw a little bit before which is tokenize is going to take that input string, and also just like, this is something that I would like, I'm going to use JavaScript today, I would probably use something like TypeScript normally, right?

But didn't want to add the, hey, we're gonna do a bunch of recursion. And nested while loops. Let's also as strong types in there, right? But like just basically trying to keep your sanity and stuff along those lines is super helpful usually. So we know that we need to get this set of tokens that we're gonna keep track of.

And we know that we're going to start a cursor, right? You could slice things off of the string and do a whole bunch of stuff like that. I've done that as well. I have found that like, as I've done some of the kind of real world ones, the least amount of headache by just keeping an index of where I am.

Cuz that's easy. There are cases where you might wanna move back sometimes, right? Our scheme language is pretty straightforward and you won't need to do too much of that. But there absolutely are cases where you might want to look ahead, or look behind. We'll also have some utilities that like just make it a little more clear, right?

Some things instead of like, because I don't know about you. But I can never remember what array shift and array unshift do. They I just know they do the opposite of what they think they do, but then I get all confused. So I made some helpers that will use for that kind of stuff too.

So then we'll do while cursor is less than the input link. So like, when we get to the end of the string, that's how we know we're done tokenizing. And this is obviously, you're in charge of kind of making sure that you are iterating that cursor because if you don't, it's gonna be an infinite loop.

[LAUGH] So and there's some tricks like, it wasn't appropriate for this language, but a trick that I'll use a lot of times is for a for loop, you don't have to give it every argument. So you could do something for cursor is less than input length.
>> Steve Kinney: And then.

>> Steve Kinney: Like that, and so you don't have that initial setup in there. And you can control it like that as well. I use this in the handlebars syntax, it just didn't work for what we're doing today. So I have this cursor, and at the end of the day, when everything is said and done, we are gonna return that array of tokens.

So now what we really need to do is kind of all the happiness lives in here at this point. So I'm gonna start us with the kinda simplest one. Let me take a look at the tests, let's get some of those up and running. So, for instance, this is kind of the like, does my test suite work kind of test, right?

Is it giving me array at all like these are those tests like I am a guilty of the right expect true to be true to make sure I've even configured just right. These are the tests that I would likely like delete after I had more tests right? Cuz they're not adding a ton of value, but they are pretty good for now.

So let's get that going. And we'll start with our let's make sure that works.
>> Steve Kinney: And so I'm going to do MPM test. And then one trick you can do with just is if you just type in like any characters that are in the name of the test will basically do a regx max.

Because like i don't need to go evaluate a bunch of files I've skipped testing them right now. Right, so we know that this is tokenized test like it's a MPM test tokenize right. And it will kind of look and check it out. Cool, sweet. So I got a problem, which is I don't know how to type.

So we'll solve that there's no brackets here.
>> Steve Kinney: Sweet, and so we've got our first passing test. Let's go get ourselves a failing test. So the first one is should be able to tokenize a pair of parentheses, man, if we're writing a list, we should be able to tokenize a pair of parentheses, which is if we give it the input string of spoiler alert, a pair of parentheses, we ought to be able to get those out, right?

Pretty simple. Nothing particularly special. So how would that look? One of the things that I will likely do is I'm going to, I could call this each time input and I use the square brackets cursor like this. Like
>> Steve Kinney: But you can see how I'm already messing up as it is, right?

I don't want to do that. So what I'm gonna do is I'm just a store that character constant. Because I'm going to be checking for parentheses and be checking for numbers. I'm going to be checking the letters. I don't want to type this over again. So we'll do that.

And then we've got that if parentheses helper, which is basically able to check for opening and closing parentheses, so we'll do that if is parentheses is our parentheses. I went back and forth on whether I was gonna pluralize or not. All right, if that's the case, we go into our tokens.

We push it on with some metadata, right? So the type is gonna be a parenthesis.
>> Steve Kinney: All right, and then, the value will be whatever character it was, right? Cuz we're gonna need to know later is it an opening and closing parenthesis. That is data that we need.

Other stuff, we might not necessarily care, right but in this case, we do need to know which way the smiley face is pointing. Once we've done this, we can go ahead and just say cursor plus plus because if we do not iterate that cursor, you will be waiting a very long time.

And then what i'm going to do is continue. You end up like using like old school javascript stuff that you forgot was in the language. Continue allows you to just like break out a loop and go to the next turn of the loop, we've iterated cursor ourselves. This will kick it off because like what will happen is like we're basically gonna let it.

If this isn't the case we're gonna have another conditional checking for things so on and so forth. Awesome, and we'll go check it out.
>> Steve Kinney: All right, we have two passing tests. Another one I will show you, just to show your the differences, we don't care about white space.

White space is there to show us that we can move on to a different token. So in that case we need to iterate the cursor. But we don't need to store it anywhere. So we can say if iswhitespace and you can already see like how like this will be a lot gnarlier if I didn't give myself the helpers iswhitespace character really just iterate that cursor.

>> Steve Kinney: And continue along right? So go ahead, turn that one on. We'll verify that it works. So this one is got a whitespace is kind of hard to see because it's invisible. We should get back just the array of tokens of all it has is whitespace. So all those tokens should be completely ignored.

>> Steve Kinney: All right, so here we've got exercise one. We've got digits. Now in an implementation detail that we decided upon, by we, I mean I, was that we are actually gonna store them as JavaScript numbers. Right so they're in a string right now. So it will be the string two, the string three.

So you're on the hook for numbers for parts, ending them, or putting a little plus if you want to be fancy and just use the fact that JavaScript's weird to make your life easier. Either one of those is fine with me. So want to go ahead, if we get input one, two to go ahead and get some numbers.

We also want to be able to like collect some letters as well.

Learn Straight from the Experts Who Shape the Modern Web

  • In-depth Courses
  • Industry Leading Experts
  • Learning Paths
  • Live Interactive Workshops
Get Unlimited Access Now