Type Unification

zed.dev

Lesson Description

The "Type Unification" Lesson is part of the full, Write a Compiler That Understands Types course featured in this preview video. Here's what you'd learn in this lesson:

Richard covers type unification by demonstrating how operators like plus and times require their operands to unify with specific types, such as numbers. He also explains how the unify function resolves type variables by traversing type links, imposing multiple constraints simultaneously, and detecting type mismatches that get reported at compile time.

Join Now or Start a Free 5-Day Trial

Preview

Transcript from the "Type Unification" Lesson

[00:00:00]
>> Richard Feldman: Support for as previously, mentioned that we're gonna talk about type unification. So first, we're gonna do some unifying of operators, operators in the sense of like plus and times binary operators. We're talking about type mismatches. So far we have not done anything that can possibly cause a slight mismatch, but now we will.

[00:00:16]
This is aired like support like a part-time, and then finally, the actual unify function itself. The function is gonna drive a lot of what we're doing throughout the rest of this process. It's a very, very important function, unify. Okay, so type unification basically is a way to say these two types should be identical.

[00:00:34]
I'm choosing to use the word identical here because that's a common one in the literature, but honestly, there's no one word that I found that is like, really, really a good one-to-one correspondence of what unification is, the best answer to like, what is type unification say? It says that these two types are unified according to the type unification algorithm [LAUGH].

[00:00:54]
So we're gonna see what the actual code looks like, you can kinda just walk through the conditionals and sort of understand what it's doing that way. But conceptually kinda fuzzy if you sort of squint and it's a little bit hazy, it's kinda like these two types should be identical, but it does mean something more specific than that, we'll get into what that means.

[00:01:09]
And a very important thing, compared to our previous section, where we were only dealing with constants, is that this is a way to impose multiple constraints on types. So previously, we just had a equals hi, b equals a, c equals b. At no point were any of those types in the database impacted by multiple constraints at once, it was always just like handing off to someone else.

[00:01:28]
So now we're gonna start talking about scenarios, and there's gonna be a lot of them in a type checker where you have types that are being constrained by multiple things, and the type inference is not just like, I'm just handing off to this thing, but it's like, well, this thing is saying this, and this thing is saying this, and sometimes that there's a happy agreement there.

[00:01:45]
We're like, good, everybody agrees, we're all set, and then, of course, sometimes they disagree, and then that's when we have a type mismatch. And a type mismatch error is basically the bread and butter of the type of reporting that our type checker or type inference system is going to do as far as build time.

[00:02:00]
Okay, so start off by unifying some operators here. So we'll start off with our classic const a = hi, const b = a, but this time, instead of doing a const c equals like we did last time, now we're gonna actually run an operator, we're gonna do a times b.

[00:02:14]
Now operators can be implemented in different ways, in different languages. The way that we are choosing to implement them in this language is that it's basically like a hard coded built into the language thing where we know that the rule is, anytime you see one of these stars, we're saying both things on either side of it have to be a number, and also the whole thing is gonna evaluate to a number.

[00:02:33]
There's other languages where you have custom operators and stuff like that, where this might not be the case, but again, for our simplified subset of JavaScript, that's the rule. It's just like we know right now at compile time, this has to be a number, this has to be a number, and the whole thing is gonna return or evaluate to a number.

[00:02:46]
Given that, what we want is we want our type checker to infer two type mismatches here. The first type mismatch is because a is set to be a string, and we said this is supposed to be a number, but it's not a number, it's a string. And the second thing is that b is supposed to be a, so again, using type inference, we should know that b is also a string, and therefore b should get its own type mismatch, because b is also not a number, and both of these are supposed to be numbers.

[00:03:08]
So that's what we're trying to aim for here, end up with two type mismatches. So we're gonna start by doing exactly the same thing we did last time. It's gonna be a recurring theme throughout the course, is just go through and assign type IDs to all these things.

[00:03:19]
So a gets 0, hi gets 1, just like last time b gets 2, a gets 0. And then, once again, we're gonna do this sort of look-up thing like we did right here, where we say, okay, well, once we come down here, what's in scope? I see a is 0, and then b is, once again, gonna be 2.

[00:03:36]
So initially, when we just have 0, 1, and 2 in here in our type's database, everything's null, just like last time. We're gonna add a little three later on that is gonna be the type of the entire multiplication operation, like what it actually returns. But for now, we're just gonna start off by just dealing with the types of 1 and 2, so that's a, b, and hi, so 0, 1, and 2, a, b, and hi.

[00:03:59]
So a once again is 0, b is 2, and then this is a 3. Okay, so at the beginning, again, we have exactly the same thing as last time. We first go through and we say, okay, a has a relationship with this thing where it's similar to it, and basically whatever type this ends up having, that is exactly the type that id 0 is gonna have.

[00:04:19]
And then we proceed on to 1, and we say, it's a string literal, therefore, it's type of string. So far, so good, and then we also continue on with, okay, this thing is whatever that type is. And then the next thing that we know is that for sure this is a number.

[00:04:31]
And this, much like with a string literal, is just hard-coded. So in exactly the same way that as soon as we see a string literal parse tree node, we're like, that its type of string. Anytime we see a multiplication operator, we're like, it evaluates to is a number.

[00:04:45]
So we can just write that down without having to do any relationship stuff. Okay, so that's the basics. So far, at this point, we've basically just done the same kinda stuff that we did in the previous section, just with a slightly different example. Now let's make it more interesting.

[00:04:59]
So what we're gonna do next is we're gonna do what's called unifying both the operators arguments that are going into the star with the number type. So unifying essentially means we're going to call this unify function on each of them, passing them as one of the arguments and then something else as one of the other arguments.

[00:05:14]
In this case, it's gonna be zero and number and then two and number. So what does that look like, and what does that actually do? So conceptually, I'm gonna start off with just sort of like a high-level, kinda fuzzy description of the basics of unification. Then we're gonna get into code, which is gonna be much more detailed and give you a much sharper picture of what this is doing.

[00:05:33]
But at a basic level, if we're zoomed out, if I give unify two nulls, that is to say, two entries in the database that have not been initialized at all. They're just kinda like still in that very early, we just went around assigning things, and nobody really knows what these are yet state.

[00:05:47]
If you give it two nulls, that says, okay, well, one of these nulls is gonna turn into a symlink. I'm not saying exactly which one it is, this arrow, by the way, is not returned. This is like, what it's gonna do, if that makes sense. So it's gonna symlink one of them to the other one.

[00:06:01]
The order is not necessarily that critical, the critical thing is the relationship. It's saying this thing, anytime you ask it what it is, it's that other one, and now we have linked the two together. Actually, it's pretty important that you don't symlink them to each other because then you're like, hey, what are your values?

[00:06:14]
You have to go ask that one. Hey, what's your value? You have to go ask that one, and then you get an infinite loop. So always only ever wanna have one symlinked to the other, not symlinked to each other. The other common scenario that you'll see is basically you give unified two concrete types that are exactly the same.

[00:06:28]
So basically this one's like, hey, I have two strings. Here's a string, and then here's another string. This is okay, well, great, clearly, both of you are strings, so just stay being strings, and we're not gonna make any changes. Same thing if you're both numbers, just again, no change necessary.

[00:06:41]
Obviously, the answer is these two things unify to be a number, so no big deal. And then of course, we have the case where one of them is a concrete type and the other one is a different concrete type, and that's gonna be a type mismatch where we actually reporting the error to the user saying, hey, according to the unification algorithm, you said that this thing was a string and it's also a number, those are very much not identical.

[00:07:01]
If we're thinking about this as something that's making these types identical, therefore we're gonna report a type mismatch and that's gonna be the game over for those types. Okay, so let's put this into practice. So at this point, we're just gonna focus on these two. So these other ones, they still have those types associated with them.

[00:07:17]
The type ids associated with them, but they're just kinda hanging out in the database. And the main thing that we're gonna look at is unifying these two things with the number type. So essentially, from a function calling perspective, again, in the actual code, as you'll see, this will actually be like a JavaScript object with more stuff going on it.

[00:07:33]
But if we just think about this conceptually as we're calling the unify function, passing a number, and then one of these things being a string, that's gonna be a type mismatch. Same thing with the other one, it's gonna be a type mismatch. So the whole problem here is that if I look up what is the type of a, a is id 0.

[00:07:52]
Id 0 says, go ask 1, 1 says I'm a string, b says I'm an id 2, id 2 also points to 0, which points to 1, which points to string. And so when unify operates on its two values, it is going with what is at the end of the line here, what is the ultimate value?

[00:08:08]
Ultimate in the sense, not the ultimate, in the sense of the last one in the chain. If you completely resolve these symlinks, you follow them all the way through, what is the actual type there? If you were hovering over this in your editor and you asked, what is the type here?

[00:08:22]
That's what unify is getting. So unify is not being given, well, okay, technically it is being given, like the zero and the two, but the first thing it does is just turn them into, just traverse all the symlinks and get me either a concrete type or a null, does something where I'm like there is no further hops to traverse.

[00:08:38]
I have traversed as far as possible, that's what I'm gonna be considering. And that's why, when it gets these, it's like, okay, I see a number and a string, I see a type mismatch, I see a number and string, I see a type mismatch. Even though, if you look at each of these, it doesn't say number right here or string right here.

[00:08:52]
If you follow the symlinks, it does end up saying string, and that's why it ends up unifying these two to a type mismatch. So for that perspective, great, our goal of inferring two type mismatches, great success. We have succeeded [LAUGH] in inferring two type mismatches as desired. So now let's go take a look at the actual code here, because this is the part that I think is the most important to understand.

[00:09:14]
That was the sort of high-level overview, conceptual. Here's how we get the type matches out of it. But the specifics of this algorithm are very important. So, to recap, we started off with turning things into tokens. We've got our source code right here, and we build up multiple different tokens here.

[00:09:29]
So this one is type const, this one's type identifier. Then we turn those into parse tree nodes, which is just a nested form of the tokens. Then we turn those into typed parse tree nodes where we added the typeId. And then we started building up our types database with all the nulls at first, and then putting in the relationships.

[00:09:44]
And then finally, we have the actual code that builds those up. So with all of that in mind, here is where we have ended up code-wise. We've done all that other stuff. We have completely populated our types database right up to the point of saying we have these two additional constraints, which is that the thing to the left of the star needs to be a number, and the thing to the right of the star also needs to be a number.

[00:10:07]
And I did mention that when we are thinking about these things, it's really important that we're not unifying based on what's actually in the row in the database, but rather after you traverse all the symlinks, what is that value? So the very first thing that unify does, cuz it does take two type Ids, is it says, look, just go, I'm gonna call this function, resolve symlinks and compress just so you know everything that it's doing [LAUGH].

[00:10:28]
[COUGH] And that's basically saying give me the typeId, I'm gonna turn that into an actual type, so a concrete type or a null. So this point is not an ID anymore. This resolves symlinks and compress function. Hopefully, you can guess what it does. It resolves all the symlinks and then does path compression.

[00:10:44]
This is the first thing that it does, is like, do that for both a and b, and this is exactly the reason that it is so important for performance that you have the path compression. So if you imagine, every single time we're going through and calling this unify function, which happens all the time.

[00:10:57]
If you don't have the path compression on, you're traversing those long hops over and over and over again. So it's a huge deal for performance that we're able to compress those down and say, yep, just one hop, no big deal. Okay, next thing we do is we start looking into different scenarios.

[00:11:11]
So the first scenario is basically like, okay, let's suppose that even after resolving those symlinks, a is still null, which means that, basically, we don't have any information about a. The term for this is an unbound variable. It's basically like, yeah, we put it in there, we wrote down, this is your ID, but in the database, it's still null.

[00:11:28]
We don't actually know anything about this thing at all. So when that's the case, the first thing we wanna do is say, okay, we're going to give a some sort of type. For sure, we're gonna write something in there. So this is db[aTypeId], and this is why it's important that we have the TypeId in addition to the actual type.

[00:11:46]
We look at the actual type when we're considering what to do with it. But then we need the TypeId so that we know where to store things in the database if we wanna make changes, which Unify very often does. So we're gonna store something new in the database and we're gonna do a little ternary here.

[00:11:59]
So first we're gonna say, okay, was b also null, a was null, but was b null as well? If b was also null, then we're in that first scenario I talked about a couple slides ago where it's like, okay, both of these are currently null, we don't know anything about them, or at least we didn't, but now we know whatever they end up being, they are linked.

[00:12:15]
So we're gonna make a, b a symlink of b, could also just as easily put b as a symlink of a here, but we're gonna be a-focused because a was at the top of the [LAUGH] null check by coincidence. But then there's also the possibility that b was not null.

[00:12:28]
So the next scenario we're about to look at is a was null, we don't know anything about a, but b already had something going on, and not just a symlink, cuz we already resolved all the symlinks. So this means that basically, b was a concrete type at this point.

[00:12:41]
So at this point, we know, by process elimination, it couldn't be a symlink cuz we already resolved all the symlinks. We already had a check to see if it was null, so if it's not null, it's not assembling, it's gotta be a concrete type. So here, at this point, we're like, okay, cool, that means that basically, if we know that b is a concrete type, we know that it has the concrete field.

[00:13:00]
That's the shape of the object, it's like one of these concrete things. And basically this is just saying just duplicate it. We're just gonna say, okay, whatever b was, I mean, a was null, we are now just saying, boom, just duplicate that. I mentioned previously that this is also a place where it's totally reasonable if you want to, to just always symlink it.

[00:13:17]
And just say, well, yeah, I don't wanna potentially copy that big concrete thing. What if this is a gigantic concrete type? Totally reasonable, basically, I wanted to just kind of show the other way of doing this. But if you want to, you can do exactly the same thing we did in path compression.

[00:13:31]
And just say, yeah, I'm not even gonna bother with this conditional, I'm just gonna just always symlink a to b. And then, wash my hands of it, I'm all done. So, whether or not you have this conditional is kind of up to you in this particular scenario, but I do wanna note, and I did wanna show the conditional that what's actually happening here is pretty important.

[00:13:50]
It's not like you're just sort of uncritically just saying, a is just gonna be symlink to b like no matter what. If you're not aware of those scenarios, you could run into an edge case, like what if b is a symlink? We're not really looking at that here, but I think I may have actually, in one of the solutions, I think I may have actually done that as one of the edge cases of like, okay, well, it is technically possible that.

[00:14:12]
No, it's technically possible because we resolve the symlink, I'm thinking of something else. So yeah, definitely at this point, either b is going to be null or it's going to be concrete because if it were symlink, it would have been resolved. So, putting all that together, it's basically this is another one of those scenarios where it's up to your choice as a compiler author which design you wanna go with.

[00:14:32]
Either you can copy the concrete thing, or you can do what we did with in path compression and just say, yep, they're just symlinked, and that's how they're related.

Learn Straight from the Experts Who Shape the Modern Web

In-depth Courses
Industry Leading Experts
Learning Paths
Live Interactive Workshops

Start a 5-Day Free Trial