Parsing HTTP Requests Exercise

zed.dev

Lesson Description

The "Parsing HTTP Requests Exercise" Lesson is part of the full, C Fundamentals course featured in this preview video. Here's what you'd learn in this lesson:

Students are instructed to analyze the C program and fix the bugs described by the tests. The bugs include removing initial slashes from the path and ensuring the correct file path is specified.

Join Now or Start a Free 5-Day Trial

Preview

Transcript from the "Parsing HTTP Requests Exercise" Lesson

[00:00:00]
>> Richard Feldman: Let's do an exercise. All right, so as usual, we're gonna start off by just building the thing here. So it says, "Should be blog/index.html": "/blog/index.html". This is the world's cheapest unit test suite [LAUGH] we've got here. So it's printing out what it expects and then it's printing out what it actually is.

[00:00:18]
This one says, Should be blog/index.html, but it says /blog/index.html. So both of these are actually slightly incorrect right now. There's a little bit of an off-by-one error. So there's this old joke, you may have heard of it. The old joke is, what are the two hardest things in programming?

[00:00:34]
And they are cache and validation, naming things on off-by-one errors. This is an example of an off-by-one error. So we've almost got it right, but we just had this, there's this little slash that's sticking around and we wanna get rid of that. So that's part of the exercise, is fixing that.

[00:00:50]
Here, we have index.html is what we expect, but again, there's this, there's that /index.html. And then finally, this last one we want to be null, but instead it's /blog/index.html. So in this exercise, we're basically gonna be doing some bug fixing. Okay, so this right here might look familiar from the slides, this is exactly the same stuff we're doing there.

[00:01:13]
We have char *start and end. We've got our to_path functions taking in the request address, right? The for loop, where we start off at the beginning of the request, we keep going until we hit a space, just start++ along the way. If we encounter a null byte, which we're using the syntax for that, then really return null rather than continuing to read past the end of the string.

[00:01:31]
We skip over the space right there in the GET request, and then we keep going to end begins at start, and then we just keep progressing best there. And then finally, we have this little bit of logic there that's basically saying, okay, if we end in a slash, like we got to get for a /blog/, that's fine, just move the sort of where we're ending back 1 so that we're now on the slash.

[00:01:56]
If we didn't have a slash there, then we're gonna write 1 into the original string. And then finally, we do the memcpy, where we're basically overwriting whatever was in the request string originally to get the index.html plus the null terminator. So two things we're gonna do here, one is a little refactor.

[00:02:15]
So this is not gonna be a bug fix, but it's basically, hey, try refactoring out this +1 by modifying the 'if/else' above. So there's another way you could express this so that the math works out to be the same thing. We're just basically making end be in terms of something else, and that's a totally reasonable way to do it.

[00:02:31]
You could do it this way or you could do it where the math that we do for end is such that we don't need this +1 here because we've sort of accounted for it earlier. So let's start off by trying that out, and then also try not copying in the null terminator and see what happens.

[00:02:44]
[LAUGH] If you make the off by 1 error where you forget the +1 here, what actually ends up happening to the running program? Then we have the actual bug fix, so it says, hey, these three are currently not trimming off the leading slash, so fix that bug. We wanna modify to_path to actually fix this so that when we print out should be this with no slash in front, we actually trim that off.

[00:03:07]
Okay, then [LAUGH] Before fixing this next one, let's try moving it up to the beginning of main and just see what happens. So basically, take all this code and just move it up to the top of main and run it, and just see what happens. This is gonna be another example of a fun behavior that you can get in C if you do things in the wrong order.

[00:03:25]
[LAUGH] And yeah, then finally, we'll fix this one. So this is the one where we wanted to print out null, but instead it's actually printing out "/blog/index.html", and we're gonna fix that.
>> Richard Feldman: All right, so initially, we have some bugs here where we've got these leading slashes that we don't want, and then also this last one is sort of a separate case.

[00:03:50]
But before we do that, we're gonna actually do a little bit of a refactoring. So this is essentially like, okay, we want to refactor out this +1 by modifying the 'if/else' above. So essentially what we're saying here is like, okay, let's suppose that end at this point is not gonna be +1.

[00:04:08]
Well, that means that by the end of this if, we want it to have been one higher than it was otherwise. So in this case, basically, we can say, okay, well for this branch, let's just delete that. So instead of doing minus, minus, when we have this scenario, we're just not gonna do anything, and that will have the effect of making end be 1 higher by the time we get down here than it would have otherwise been.

[00:04:29]
Similarly, on this branch, we can say, okay, well, end[0] is '/', we still wanna do that, but then afterwards we want end to be one further than it would have been. So we can just do end++ here. So now when we rerun this, we should get exactly the same output, because what we did was essentially a no op.

[00:04:48]
And now we can actually just get rid of that else, and we can make this instead of being an if/else, we can make it just a single if condition, and that will do exactly the same thing. This is an example of how a lot of times in C, even things like little refactorings just often involve doing some arithmetic, and once again, they're a potential source of off-by-one errors.

[00:05:09]
When I'm programming in other languages, or at least when I'm programming in other languages and/or not in this type of a style, I don't tend to be as concerned about off-by-one errors because I'm doing a lot more instead of a for loop with ++, I'm doing a forEach or something like that.

[00:05:25]
But when we're doing this type of programming, off-by-one errors are just a lot more prevalent and a lot easier to mess up. If I forgot this ++ right here, suddenly I rerun this program and, wait, blog/index.html, what's going on there? So it's very easy to end up with bugs as a consequence of not quite remembering to do this or just getting the math off just a little bit.

[00:05:46]
So something to be aware of. Okay, so we did that refactor. Now we're going to try removing the +1 here. So this is essentially just now we're gonna copy over the index.html, but we're not gonna copy over the null terminator. And now essentially what we're seeing is just the rest of the original HTTP request that we're stomping over now becomes visible [LAUGH] And actually just keeps showing up until we run into the null terminator that happened to be in the original one there.

[00:06:12]
So again, no guardrails, if we don't remember to copy over the null terminator, we're just gonna keep reading that memory until we encounter our actual null terminator there. Cool, okay, and then now we come to our actual bug. So these three don't currently trim off the leading slash.

[00:06:27]
So what we wanna do is modify to path to fix them. So right now we're saying, okay, GET /blog, HTTP, blah, blah, blah, okay? And what this should be is blog/index.html without the slash in front of it, but when we run the program, it's /blog/index.html. So essentially, again, we have an off-by-one error here.

[00:06:46]
So how do we fix this? Well, basically what we wanna do is, we have our sort of start and end here. So the end is where we end up with, I'm sorry, start and end are basically both slashes. So the start is the slash at the beginning here, and then the end is the slash at the end of this thing.

[00:07:04]
So if we don't wanna deal with this anymore, basically all we need to do is we just need to move start over by one. So let's go ahead and do that. Basically, right here when we're skipping over the space, what we could do is, yeah, okay, we'll show it this way.

[00:07:17]
So let's say I did a second start++. So this is gonna do two things. One is that it's basically skipping over the space here. So that's good, that's what we wanted here. And the other thing that it's doing is it's also advancing end with this. So this is sort of, it means that end begins on start, and that essentially means that end is skipping over one thing.

[00:07:41]
Now, because all we're doing here is we're just looping until we see a space, this actually isn't changing the semantics of our program at all. Because the only case where this would matter is if we specifically happen to be, immediately after the start was either an null terminator like this or a space, then this could matter.

[00:08:00]
But since it's not either of those, this isn't a problem. So the other thing that we could do here is we could combine these two consecutive ++ into a += 2. That would also work exactly the same way. And voila, the bug is now fixed. Cool, so fix that off-by-one error, introduce some others, fix them, great.

[00:08:17]
So now, before fixing this now one, we're gonna try moving it up to the beginning of main and see what happens.
>> Richard Feldman: Okay, so all I did was move it up there. Boom, abort, this is one of our new fun no stack trace errors. We've seen segfault and bus error, and now we're seeing abort.

[00:08:36]
So what's going on there? And why is it that if I put this code at the end of the file, or sorry, at the end of the function, it works fine. And by the way, if you look at this, this looks like it should be pretty self-contained, right?

[00:08:46]
I'm seeing rec4, I've got rec1, rec2, rec3 for the different requests up here, rec4[]=, and then I just call to_path on rec4. This doesn't reference any of the rec1, rec2, or rec3s. All it references is, we're calling this function, we're parsing in this local thing, and we're not using any globals.

[00:09:05]
I mean, we're using some string constants, those aren't a problem. So what could possibly be happening [LAUGH] That this thing works? Well, it gets the wrong answer, granted, we wanted to see null here. But it works fine, it runs fine. But then as soon as I move it up to the top of the function, it pulls up the entire program.

[00:09:23]
Does anyone know what's happening here?
>> Speaker 2: Are we busting our stack?
>> Richard Feldman: Yeah, [LAUGH] Basically, we're busting our stack. So essentially what's happening here is that I mentioned way back in the slides about warning icon, and I said, we're not currently checking to see what happens if we're reading off the end of the file here, or sorry, if we're writing off the end of the string here.

[00:09:46]
So we could be saying memcpy into here, and the string that we are memcopying into, the request string might not actually be long enough to accommodate that index.html followed by the null terminator. In other words, the length of this, end plus this, might actually be less than what's remaining in that request string.

[00:10:06]
And if that's the case, we're just gonna uncritically just blow over it and just write into whatever bytes are there, and then we're just gonna find out what happens. So basically, this is designed to simulate that, right? So I've intentionally omitted the header that would normally be there if this were coming from a browser where it would have the protocol as well as at least one header.

[00:10:29]
I omitted those on purpose in order to cause this error. And so basically, what's happening here is that this is allocating these bytes on the stack. So it's like, here's an array in memory, in the function. And then this thing is going to actually write into those bytes, and it's gonna happily just write right off the end of those because there isn't enough room here for index.html.

[00:10:47]
So whatever happens to be in memory there, that's what's gonna get stomped on and overwritten. And the reason it behaves differently, if I put it at the end of the function versus earlier, is that whatever happens to be there after those bytes changes depending on whether I'm doing this at the beginning of the function or the end.

[00:11:04]
And the reason the program's blowing up is that, up here, when we get to those bytes, the processor or the operating system, it tries to do something when those bytes are overridden that results in some sort of operating system error. So I don't actually know specifically what those bites are doing or what the exact thing is.

[00:11:26]
You'd have to look at the actual assembly to try to figure that out. But it's probably something along the lines of, remember I mentioned at the very beginning that there was some assembly that we were omitting for the boilerplate of beginning of functions. And the boilerplate is basically stuff like, looking at what the arguments are, getting those out of the global variables into the registers, or storing them, storing where the function's gonna return to and things like that.

[00:11:49]
And that type of stuff, if you mess around with that and you override those bytes, you can very easily end up with a function that when it returns, instead of just happily going back, it just jumps to some wildly other part of the program. And that's probably what's happening here, is this, when it's at the end here, is overwriting some memory that appears to be pretty harmless and you can't even notice that there's a bug.

[00:12:08]
And if you move it up here instead, now it's overwriting bytes that are not harmless at all and are instead blowing up the entire program. This is another example of memory unsafety in C, a new flavor that we haven't seen yet, where you can just take code that does one thing and works, as far as you can tell, totally normally, and then as soon as you just move it up somewhere else, all hell breaks loose.

[00:12:26]
So you can very easily in C have something that looks like a normal, harmless refactor where you're literally just moving code around and all of a sudden your whole program blows up with no stack trace. Hey, welcome to C [LAUGH]. So what this is an example of is this is an example of something where you don't know that there's a problem because there are no symptoms normally.

[00:12:47]
When you run this program, okay, I mean, we know that there's a bug here, in that I expect it to be something other than what it's printing, but it kinda does look like it did the thing. There's no sort of surface-level symptom to give us even a clue that we're making this memory unsafety bug.

[00:13:01]
And this is one of the things that makes memory safety bugs so pernicious, is that in a lot of cases, there's a lot of different factors that go into reproducing it. Another thing I haven't tried, but if you played around with using different compiler optimization settings to build this, compiler optimizations will often rearrange things inside the function in ways that it assumes are harmless, assuming your code is actually doing the right thing.

[00:13:23]
Another possibility is that these things could end up with the memory being in such a place that this code works fine, that as soon as you run the optimizer, it blows up. And you might blame the optimizer and say the optimizer must have a bug in it, but it might not.

[00:13:34]
The reason that it might be blowing up is just that the bug was always there. You just weren't happening to see any symptom because the memory it was stomping over happened to be harmless until you kicked in the optimizations, and then it happened to be overriding memory that was not harmless.

[00:13:47]
So there's this whole can of worms to memory safety bugs, that doesn't this make you appreciate it, having a garbage collector? Yeah.
>> Speaker 2: You are debug these kinda things with stack canaries and like.
>> Richard Feldman: Yes, if that's what the problem is, yeah. So a stack canary, [LAUGH] Basically, it's a trick where you attempt to detect when you are making the mistake of, so the stack is a part of memory that's reserved for local variables inside of functions.

[00:14:13]
And as you call more and more functions, that section of memory just gets bigger and bigger. And a stack canary is where you basically say, okay, I'm a little bit concerned that I might have a bug. Or maybe some malicious attacker is trying to create a bug where we are writing way past the end of the stack or reading way past the end of the stack.

[00:14:28]
And what you can do to guard against this is you do, it's making a guard page, where basically one of the things you can tell the operating system is you can say, hey, this chunk of memory, please mark that as read only. So it's not innately read only like it is if it's something that's part of the executable, but please say this whole range of addresses, if I ever write to or read from those, I want you to cause a segfault.

[00:14:51]
I want that to happen because nobody should ever be writing in there. So you intentionally put one of those at the end of the stack and you say, this is segfault zone. If anybody ever sets foot in here, reads or writes, I guess it's only writes to, tries to write in there, don't make the write happen, just blow up.

[00:15:07]
And so then if you do that, that can give you more of a clue. Well, first of all, it can be a safety thing where you don't want the attacker to do something even worse than that. You'd rather blow up and let them get access to whatever they're trying to get access to.

[00:15:19]
Yeah, there's this thing called a stack-smashing attack, which is a whole [LAUGH] Thing that we're not going to get into. But yeah, that is one example of a thing you can do to not prevent the bug, but at least give you a clue of what's actually happening there.

[00:15:33]
Okay, yeah, and then the last thing we're gonna do is we're gonna fix this by handling the case where rec is too short to have "index.html" memcop'd into it. Hint, strlen returns an integer whose type is not 'int', but rather 'size_t'. This is a hint, because as you're fixing this, we do actually need to do this, we're going to encounter this compiler error.

[00:15:49]
So basically, what's happening here is we are saying, okay, I want to detect whether or not we are going to write over this, and if we are going to write over this, then let's blow up instead. So basically we've got our length of the thing that we're copying in and we've got our starting point.

[00:16:07]
And what we wanna do is we wanna say, okay, well, the last byte that we're gonna copy is gonna be end + this, right? So this is the initial address, and then we're gonna copy all the way to the end of that. So the condition we wanna detect is if all of that is greater than strlen(req), then we wanna just return null, basically.

[00:16:27]
There are other ways we can handle this. We could say, well, we'll just partially write it. But partially writing it in this case is not a good solution here, because what's actually gonna happen here if we do that is, we're gonna end up with essentially a path that was partially written, and it's not going to correspond anything on the file system.

[00:16:43]
So we don't wanna do that, we'd rather just error out here. Yeah, it's complaining about this. This is just a warning, but it's kinda fine. So what we can do if we wanna avoid that warning is, I think this is pointer, yeah, so we can just say this is, was it size_t I think.

[00:16:59]
Is that right? Yeah, cool, there we go. Oops, no, now they're all null. I must have made a mistake. So end + strlen(DEFAULT_FILE) + 1 is greater than strlen(req). I'm sorry. Okay, so this is actually, I wanna check if we are off the end of that. So what I actually wanna do is req + that, and this will also need to be size_t.

[00:17:27]
Yeah, so this is the start of the request plus the end, that gives us the last byte. And we wanna see if the last byte that we're gonna write is off the end of that. There we go. So yeah, now all of these are unaffected because in these cases, they are not trying to write past the end of the request.

[00:17:42]
This one was though, because we had that intentionally truncated request that's missing headers, and so therefore it tripped up the null case, and now we're returning null. And thankfully, now, this bug is fixed, and we can safely move this thing up to the top of the function and nothing bad will happen, because it's just gonna trip that null case again.

[00:18:00]
And the only difference is that now it prints at the beginning rather than at the end cuz we moved it up.

Learn Straight from the Experts Who Shape the Modern Web

In-depth Courses
Industry Leading Experts
Learning Paths
Live Interactive Workshops

Start a 5-Day Free Trial