Building Your Own Programming Language

Lexing Characters & Strings

Steve Kinney

Steve Kinney

Building Your Own Programming Language

Check out a free preview of the full Building Your Own Programming Language course

The "Lexing Characters & Strings" Lesson is part of the full, Building Your Own Programming Language course featured in this preview video. Here's what you'd learn in this lesson:

Steve moves on to the second exercise, and lives code a test that looks for characters and strings and tockenizes them.


Transcript from the "Lexing Characters & Strings" Lesson

>> Steve Kinney: So let's let's finish up this lecture. What we need to do is, we need to get our quotes working, or I'm sorry, I was like getting the multi characters names working. And then, we'll go ahead and get strings as well, we should get that white space one for free.

So in this case, it's pretty much the same. We can just say, we'll call it a symbol in this case, it's not a number. Is that first one that we came across since we know that that one's fair game. And then, we'll go ahead and say, while and we're actually when we get to the strings, I don't have a like a really great way of they have a great way of peeking but not of pulling the next one off.

So I'm just using the kind of like regular syntax, but when we get to arrays, we'll actually use a helper for this as well. So if letter, I will look at the input, well, as well as cursor, right? And then we'll just say, symbol +=,
>> Steve Kinney: Where the cursor is now.

>> Steve Kinney: Cool, all right, and then once we have that done, we just push it on the symbol instead of the character.
>> Steve Kinney: And that last one where the while broke should get us onto the white space.
>> Steve Kinney: All right, sweet. And strings are just a little bit different in a very subtle way which is that they start, well, as soon as we hit a double quote, and they keep going indiscriminately until we hit the other double quote.

It's almost the opposite, all right? Are you a letter, right? For the previous one. Cool, join me or your number for that multi digit number, join me. For string, it's actually, hey man, anything can come onto this string until we get to the closing quote and then we're done here, all right?

So it's just basically the opposite. So we can do, I have a isQuote helper cuz our language only supports double quotes, [LAUGH] that's problematic for you. So if it is a quote, cool, we know it's time to start a string, I'll say lead string. Now, we don't want that first quote, right?

We do like, for the letters like hey, you're the first letter, you're the first number, I need you, join me. This one are like, we don't, like our language doesn't care about quotes, it doesn't care about white space. It's throwing these things in the garbage can as part of the tokenization.

It needs them to figure out where the token starts, right? But notice how we're getting into the type. There's definitely trade offs that we're gonna make throughout the day, right? Like if we wanted to purely the interpreter route, like just keeping primitives the whole way through and not making these objects that have some metadata, it's super easy.

Especially because of our list, which is a set of lists. And like you just put the values in an array, it's super easy if we're going to get the interpreter only route. But we also wanna get comfortable with ASTs, and compilers, and stuff like that. So we're kind of in a way,-because we wanna have the best of both worlds, we have to make some trade offs.

So yeah, we kinda get this metadata route. And so we'll say, while, now, this one's slightly different, not isQuote, right? So while the next thing is not a quote, because our strings can have effectively anything. And we're not good like, yeah, like if we're doing this, if again, we were spending all week writing Alexa all day even like we would like let back slashes, escape, another double quote, or something along those lines, we're not going to do that.

Cursor are like you should, but we're not going to.
>> Steve Kinney: +- input[cursor] maybe I spell input, right? Cool, and then we go ahead and,
>> Steve Kinney: Push on, this one's gonna be a type of string,
>> Steve Kinney: Value, of the string that we just collected. Now, this one again, slightly different, before, we knew that we were at the end of the multi digit number or the end of the identifier slash name because we hit white space.

This point, we actually hit the closing quote, right? Which we wanna throw in the garbage can, right? So in this, case we'll go ahead and just move onto one more character as well.
>> Steve Kinney: And this is why it's useful to have the cursor. I think that's to answer your question earlier, Jason, like the fact that we have complete control over the index of it.

And like we don't have to play with like, if we're doing this and a reduced, it would be a little trickier.
>> Steve Kinney: And I say that as somebody who started with like I wanna use a reduced. [LAUGH] I say that from from having tried and failing miserably at it.

All right, so that should be everything. Let's unskip those tests and see if I made any terrible mistakes while I was talking.
>> Steve Kinney: And we can pretty much unskip them all at this point.
>> Steve Kinney: All right, we have.

Learn Straight from the Experts Who Shape the Modern Web

  • In-depth Courses
  • Industry Leading Experts
  • Learning Paths
  • Live Interactive Workshops
Get Unlimited Access Now