Byte Arrays

zed.dev

Lesson Description

The "Byte Arrays" Lesson is part of the full, C Fundamentals course featured in this preview video. Here's what you'd learn in this lesson:

Richard explores how byte arrays represent strings in C. Each integer value in memory corresponds to an ASCII character code. UTF-8 and ASCII values are compared. Pointers and memory addresses are also discussed.

Join Now or Start a Free 5-Day Trial

Preview

Transcript from the "Byte Arrays" Lesson

[00:00:00]
>> Richard Feldman: So if we go back and apply this concept to Hello, World, one of the things that might surprise you is that when we're doing this write call, and we're passing in 1, that was actually a 32-bit integer. So that was 32 ones and zeros, which was to say, the bottom one that we had on the previous slide.

[00:00:17]
That basically means that if you want to, you can specify up to 2 to the 32 different numbers in that. And that gets you up to things like 4 billion or something like that. Is the highest number you can express in a 32-bit integer. If you wanna go higher than that, you would need to actually reserve more bits in memory.

[00:00:36]
And if they wanted, they could have said, well, we're gonna have a lower number of options. And rather than it being a 32-bit integer we'll save some memory, and say you only get a 16-bit integer here, and then you only get up to 65,000 max. Or you could say it's only an 8-bit integer, a byte, and then you only get 256.

[00:00:52]
So it's really up to the designer of the particular function to decide how big the integers they wanna support are. So yeah, this is literally what's happening in memory, when we were seeing the assembly instructions that were moving the data into here it's literally this many ones and zeros in memory that's getting copied in there.

[00:01:12]
Now this right here, if we want to just look at in memory, what are the ones and zeros that correspond to Hello? These are the actual ones and zeros in memory that correspond to Hello. They are the bytes 48 which corresponds to H, 65 is e, 108 is l, which is repeated, and then 111 is o.

[00:01:32]
And this is basically a mapping that, again, we've just decided upon. We just made this up. We just decided that when you see these bit patterns, this is what should get printed on the screen. It's an entirely, outside the hardware concept, the CPU has no idea what a string encoding is.

[00:01:46]
It's just, yeah, I know about ones and zeros, and we've just decided that, yeah, when you do this, this is what your teletype, I mean, what your screen, should print out is these lines and drawings that we recognize as letters. So again, we decide how to interpret the memory.

[00:02:02]
This is all about the hardware doesn't care. This is just we have decided to ascribe certain meaning to these ones and zeros. This particular encoding happens to be ASCII, which is the American standard for C, I actually don't remember [LAUGH] Exactly what it stands for. But it's basically a character encoding that's just a way of saying, here's how we're going to map these ones and zeros to actual letters in that language.

[00:02:25]
Original ASCII did use, in fact, 1-byte per character. And actually, fortunately, it only used 7 of the 8 bits. Notice every single one of these starts with a 0, and that's not a coincidence. In ASCII, that's just always true. ASCII always has the very first bit set to be 0 just because they only use 7 out of the 8 bits to represent all of the letters they wanted to use.

[00:02:49]
Now, as it turns out, American English is not all the languages in the world. There's actually others. And so yeah, this is where we got Unicode. And fortunately, there was this really, really clever encoding. And it's required that if you ever introduce someone to the concept of UTF-8, you have to explain how clever this is.

[00:03:05]
But they managed to mix an encoding that was backwards and compatible with ASCII, which is basically if we were to look at Unicode characters here, this stuff that we have right here, this is both ASCII and valid UTF-8. Because UTF-8 cleverly said for all of the first 256 characters that we support in UTF-8, we're gonna choose to have them be exactly the same as ASCII.

[00:03:28]
So UTF-8 is totally backwards compatible with ASCII. If you have some program that was using ASCII for its strings in the 1970s and you decide to upgrade UTF-8, all of that code will still just completely work as is, because UTF-8 represents those strings exactly the same way in memory.

[00:03:42]
What they did was they said, we're gonna use the fact that that first bit was unused in ASCII and say, if we set that first bit to a 1, now we're in UTF-8 land. And that 1 indicates, okay, now we're gonna shift gears and get into, okay, now we're gonna do fancy Unicode stuff.

[00:03:58]
And all the old code wasn't using that anyway. None of the ASCII stuff ever actually had a 1 there. So it was totally fine to do that, because that was sort of unused. And so that first digit in ASCII is unused, but in Unicode, sorry, in UTF-8, it's a flag saying, are we doing fancy Unicode stuff or not?

[00:04:13]
And then the other 7 bits are interpreted differently depending on whether that first bit was set. And one of the things that UTF-8 allows is, again, we decide how to interpret memory, is sometimes depending on what the other 7 bits are, UTF-8 might say, you know what, this letter actually takes up multiple bytes.

[00:04:30]
We're gonna take this one, and also we're gonna use this one, and possibly also this one, maybe even this one. So that's one of the things that they use to be able to expand how many different character sets you can encode in your text. Okay, and then the final argument of write is a 64-bit integer, meaning 64 ones and zeros in memory.

[00:04:50]
So that means that you can have a string that is very long here, not just 32-bits, we said that goes up to a string about 4 billion. So you if this were a 32-bit integer, you could only give write strings that were up to 4 gigabytes. And if it's an unsigned integer, only 2 gigabytes.

[00:05:07]
If you've ever used Java, Java came out in 1995 and they made the decision to use 32-bit integers for everything. And as a consequence of this, you can find stack overflow posts and stuff talking about hey, I'm trying to ingest this CSV that's more than 2 gigabytes, or maybe more than 4 gigabytes, whatever, 5 gigabyte CSV, and Java just won't let me do it.

[00:05:26]
And the reason for that is that Java's arrays are 32-bit integers. So they just literally cannot conceptualize something that's longer than that. You have to break it up into multiple arrays and kind of work around that. Whereas in C, you can have multiple different sizes, including 64-bit integers.

[00:05:41]
So 64-bit means you can have, I think it's 9 undecillion. I had to learn the word undecillion in order to learn how big 64-bit integers can represent things. It's more than gigabytes, I'll tell you that. [LAUGH] So yeah. We actually don't currently have any piece of hardware in the world that can max out a 64-bit integer worth of memory, nothing supports that that big.

[00:06:04]
Maybe someday they will. But, yeah, it's gonna be quite a long time before we feel, we really gotta bump it up to 128-bit integers to really maximally support these things. Yeah?
>> Speaker 2: Where in the translation or where does the translation for UTF-8 actually happen? Is that on the operating system level or where?

[00:06:28]
>> Richard Feldman: That's a great question. So the question was, where does the translation for UTF-8 actually happen? It actually depends on what's doing the displaying. So for example, when we're running at the Command line, it's actually your terminal application, or more specifically, the shell that's interpreting that. So when you say, write to standard out, you're basically just, here's some bytes, have fun with that.

[00:06:46]
And then it's up to some combination of the shell and the terminal to read those bytes and be, I'm deciding to interpret these bytes that I'm getting as UTF-8. I'm gonna render them accordingly. I don't know if this is still supported, but certainly some older shells would give you a drop down for how to interpret it.

[00:07:00]
So, for example, before Unicode sort of emerged as the standard, you would have different localization. So for example, if you were in Japan, you would have a terminal it was interpreting the same bytes differently. And you could select from a drop down, this is the Japanese encoding of of these bytes.

[00:07:17]
I have a friend who was into anime back in the 90s, and he would get these video or audio files that were encoded in Japanese. And literally, his software that tried to play them would crash because it was trying to display these Japanese encoded characters. They would get the bytes off of the disc, and it's, here's some bytes for you.

[00:07:38]
And his program, the music player or whatever, would try to display them and it would crash because it was trying to interpret them as ASCII or something else. And they just didn't fit what it was expecting and some edge case happened and it would just crash. So yeah, the short answer is it's not the operating system, but rather some program running on the operating system that's actually in charge of translating these bytes into pixels on the screen.

[00:08:00]
Whoever his job is to do that, somewhere in that thing, probably the shell, or it could just be your UI program is doing that. Also worth noting that in Windows, it's actually super common not to use UTF-8, but rather to use UTF-16. And also in the browser, it's super common to use UTF-16.

[00:08:17]
We're gonna use UTF-8 here because when you're doing HTTP, HTTP is UTF-8, that's what we're doing. And also UTF-8 has definitely kind of emerged as the most popular standard these days. These days, people tend to view UTF-16 as sort of a maybe kind of a historical mistake. It's sort of being phased out.

[00:08:34]
But along the way, Java use UTF-16, JavaScript use UTF-16 behind the scenes in the browser, and all the Microsoft Windows APIs are UTF-16, at least for now. Now you might be surprised to learn that the write function actually takes three integers. Now that might be strange because it looks like it takes an integer and then a string and then an integer, but for real, it really does take three integers.

[00:08:56]
And that's because what's actually being passed to write is an integer namely the memory address of the first byte in the string in the actual binary. So back when we were talking about the assembly stuff, we talked about how there was that LC0 section that corresponds to there's some bytes in your final binary executable.

[00:09:14]
And one of the things the C compiler will do is everywhere you see this string, it's gonna substitute the memory address of that. So essentially, what's happening here is that these ones and zeros, these exist at some point in your binary. And this right here is the integer of this first one.

[00:09:30]
It's this 0 right here, this very beginning of this section of memory, I just made this number up, I don't know what it actually would be. But it's some whatever, wherever in your binary this thing starts, that's what the C compiler is gonna insert here. And when you get down to the actual assembly, when you call write, you're passing an integer for this, a 32-bit integer.

[00:09:49]
This is gonna be a 64-bit integer, and this is gonna be another 64-bit integer. That's what's actually getting passed to that function. Each of these you can think of as essentially memory is basically a giant array of bytes. So the address of this one is ending in a 2.

[00:10:04]
This one is the same thing ending in a 3. This one's ending in a 4. That's the address of each of these, essentially an index into that gigantic array that you can conceptually think of as the memory of the running process and so on and so forth. So basically, I think this is a really good way to think about memory, is that it is basically a gigantic array of bytes.

[00:10:24]
And then we can choose to interpret that gigantic array of bytes in different ways, depending on whether we wanna think of this part of that gigantic array as, this part's the string right here, and this part's a 32-bit integer, and this part's a 64-bit integer. Something that we're gonna do a lot of in this course and that is a major part of C programming is working in terms of these memory addresses.

[00:10:44]
These are actual integers that correspond to an index into this giant array of memory. So essentially, the reason that we saw we were able to read memory garbage in our example is that basically what write takes in addition to where do you wanna write, it's give me a start index into memory.

[00:11:02]
This is the address into that array, the index into this gigantic global array. And then tell me the length, tell me how many slots do you wanna to read, how many bytes? So basically saying yeah, start here and just print out whatever the next 13 bytes are. Now that's exactly why when I changed it to 200 or 2,000 it was start here and print out the next 2,000 elements in that array.

[00:11:22]
That's exactly how you're able to get [LAUGH] exposed secrets. It's yeah, if we keep going and there's some secrets later on in the array, they're getting printed cuz that's all this function does. So on the one hand, this is a very simple design. It's basically, where do you wanna write?

[00:11:36]
What's the start index of the bytes in memory and how many bytes of that gigantic array that we call memory do you wanna write? On the other hand, we can also see how this simplicity is both powerful and also very, very error-prone. If you get this wrong, all bets are off.

[00:11:53]
Whatever happens to be in that memory is being sent to whatever destination we give to write. Okay, now I wanna address the term pointer. For the purposes of this workshop, I'm gonna tend to use the term memory address because I think it's more intuitive, but you will encounter this term pointer.

[00:12:08]
And it's famously a term that a lot of people tend to associate with, it took me a long time to understand what pointers mean. That's how I'm gonna use the term memory address. But if you ever do encounter this term, pointer just means memory address. There's this kind of funny tweet.

[00:12:21]
It's, hey, why are people constantly complaining they don't understand pointers? Pointers are just an index into a global array of bytes that we call memory. What's the problem? And I think the formulation of this is supposed to be a take on the old joke. Cuz, of course, monads are also famously hard for people to understand.

[00:12:34]
And the old joke is, monads are just monoids in the category of endofunctors, what's the problem? I assume that's what he's referencing with what's the problem here? So the point of this is essentially that the term pointer, whenever you hear anyone talk about that, they literally just mean memory address.

[00:12:48]
They mean a number that's an index into the giant global array of bytes that we call memory. So I'm gonna use the term memory address for this workshop, but if you ever hear people talking about pointers, that's what they're talking about.

Learn Straight from the Experts Who Shape the Modern Web

In-depth Courses
Industry Leading Experts
Learning Paths
Live Interactive Workshops

Start a 5-Day Free Trial