Transcript from the "Git Blobs and Trees" Lesson
>> Nina Zakharenko: The way that git stores things as in git objects, and their basic one is called a blob. So git stores compressed data in a blob, along with some metadata about that blob in the header. It stores the word blob thing, this is a blob, and then the size of the content.
[00:00:24] Followed by a /0 delimiter, and that's the null string terminator in C, and finally the content.
>> Nina Zakharenko: The way that git does this under the hood is with a plumbing command called Git Dash Hash-Object. So if we ask the Git for the SHA1 of this string, Hello comma World exclamation point.
[00:00:59] We can do that by echoing this string that just prints it out in our terminal, use the pipe to pipe it into this command. And then we use the --stdin flag to say get this input from the terminal, from standard input. Otherwise, it expects a file and we get this long string, Aab686
>> Nina Zakharenko: Now, let's generate the SHA1 of the contents with the metadata as it's stored in Git. There are many tools available for many systems to generate the SHA1 hash. On my Mac, I use openssl sha1. So now when I generate the hash of that content blob, the size, the terminator, and Hello, World, I get the same number.
[00:01:58] It's a match. If you run the hash method on the same content twice, you'll always get the same result. And that's really one of the core fundamental features of git. Because of this, blobs are generally unique in git. And the likelihood of a collision, meaning that two pieces of content generate the same are infinitely small.
>> Nina Zakharenko: Now where does GIT store these objects? It stores them in the dot GIT directory. So when you run GIT Init in a directory, GIT will tell you that it initialized an empty GIT repository, it will give you the path, and the docket directory. So that .git directory contains all of the data about our repository.
[00:02:56] If we delete the .git directory, we blow our repository away, but not our files. Now, where are the blobs stored? I'm gonna run the command that I did previously to ask git hash-object, except for this time, I'm going to be passing in the w flog before the --stdin flog.
[00:03:26] And, that's asking git to go ahead and write that blob. For the purpose of this demo, I'm just going to delete the .git/hooks directory, so that you see a little bit of a cleaner output. We don't need that for now. Now, when I use the tree command to look in my .git directory.
[00:03:51] I'll see that I have a directory called Objects. It has a subdirectory that starts with 8A and then a file that starts with B6A6. So our blob is stored in that objects directory. The directory starts with the first two characters of the hash and then the file is the rest of the characters but, we need something else.
>> Nina Zakharenko: That blob is missing information. It's missing some critical things, if we're working with files, not just from input from the terminal. We don't have filenames, and we don't have directory structures either. So If we save a file as a blob, how do we know what that file was called and what directory it was stored in?
[00:04:58] Git stores this information in something called a tree. So the tree,
>> Nina Zakharenko: It contains a few things. It contains pointers, using SHA1, to blobs. And it also contains pointers to other trees, and why is that? It's because subdirectories can be nested, right?
>> Nina Zakharenko: The tree also has some metadata.
[00:05:32] It stores the type of pointer. For blob it was blob. For a blob it's blob, for a tree it's tree. It stores a file name or the directory name of the thing that it is pointing to and it also stores a mode that is the file executable is the symbolic link.
[00:05:52] Things that are important when you pull down the repo again. For example if you are storing a script and that script is executable, the next person who goes and pulls down your repo will have an executable script. They don't have to change permissions or do anything special.
>> Nina Zakharenko: So here we have our tree.
[00:06:16] In the previous example, I calculated the size. It was just the amount of characters. In this example, I have a placeholder. We have that \0 terminator and, we have a blob that points to hello.txt. I have this kind of simple directory structure. I have hello.txt, a folder called copies, and in that copies folder I have a file called hello-copy.txt.
[00:06:47] So the blob points to hello.txt and the tree points to the copies directory.
>> Nina Zakharenko: So, just a note that the copies directory here, it has a file in it. All right, show of hands, has anyone tried to add an empty directory to git? Yeah, it doesn't work, right?
[00:07:15] You can't, so git doesn't store empty directories. The issue isn't with empty trees, those work just fine. It's a limitation in the staging area. Which we're gonna talk about in the next section. So the staging area, that one only keeps track of files and not directories, so we can't store empty directories.
>> Nina Zakharenko: So, trees. They point to other trees and blobs. They're a directed graph.
>> Nina Zakharenko: And because of the feature of the Sha1, where if you run that algorithm on a piece of content, it always returns the shame sha1. In git, identical content is only stored once. So we have our directory structure from the previous example.
[00:08:13] We have a tree here. The first blob points to hello.txt. We have another tree, in this tree where does that tree point to? Which directory? Can anyone tell me.
>> Speaker 2: Copies?
>> Nina Zakharenko: That's right, it points to copies and in copies I have hello-copy.txt, an it's an exact copy of my hello file.
[00:08:40] So, that tree points to another tree ,where, and that one has a blog that points to hello-copy.txt. Now that blob points to the same place. We can see that the sha1s here match.
>> Nina Zakharenko: And this is really, really important, one of the most critical ideas about git. This is how git saves tons of space on your hard drive when storing full repositories.
[00:09:16] It's one of the critical features that has driven a lot of design decisions. So as we commit, if that blob or the tree, if it hasn't changed if it stays the same, we're just gonna point to the same copy. That's why checking out branches in git is super fast.
>> Speaker 2: There's some confusion around the difference between the blob the and the SHA-1 hash. Like what is the difference?
>> Nina Zakharenko: So the SHA-1 hash is, that number, in this example for hello.txt, the SHA-1 hash starts with AAB68. When you run hash object on a piece of content, that's the number that you'll get, it's the key for that piece of content.
>> Speaker 2: And the blob is the pointer to the hash? Or what are you saying when you say the blob?
>> Nina Zakharenko: The blob is, it stores that key, so that's how it knows which piece of content it's associated with.
>> Speaker 2: So that's like the data structure?
>> Nina Zakharenko: Yeah can I go back a few slides?
>> Speaker 2: Yeah, sure.
>> Nina Zakharenko: So I'm gonna go back to this slide. So if you all have a terminal open, let's try this out.
>> Nina Zakharenko: Type echo space, and then, make sure you get the punctuation and the capitalization right.
>> Nina Zakharenko: If you run this on your terminal and you copy it exactly, you should get the same number.
[00:11:04] Did everyone get the same result? Awesome, all right. So that number is what happens when git runs that content through a cryptographic function that generates the hash. Did that answer the question, Mark?
>> Speaker 2: I will not know because there's like a delay in the stream.
>> Nina Zakharenko: Got it, okay.
>> Speaker 2: But yeah, I mean like the question is like, when you refer to a blob, that's something other than this number. The blog's essentially is a pointer to this number.
>> Nina Zakharenko: Right.
>> Speaker 2: Okay.
>> Nina Zakharenko: And we'll get into that in a few slides.
>> Speaker 2: Okay.
>> Nina Zakharenko: So we were here talking about how identical content is only stored once.
[00:12:01] And how this allows git to be incredibly efficient. It looks like no one in this room has worked with an older version control system. I know Mark said he has. You have, you have, ok. Back in the day before git when you wanted to change branches, do you remember what that was like, it would just take forever.
[00:12:23] It would go and fetch all the copies down from the server and it wasn't instantaneous, so this is really pretty cool. So git has some other optimizations too.
>> Nina Zakharenko: It's kind of because of the way we work. So as our files change, they stay mostly similar. We might add a method here or there or change a comment line.
[00:12:52] And the way that git deals with that, so we already talked about how git objects are compressed. Git is going to optimize for files staying mostly similar by compressing these files together into a Packfile. The Packfile stores the object and it stores something called a delta. And that delta is the difference between one version of the file and the next.
[00:13:16] Pack files are generated when generally you have too many objects during garbage control. Sorry, garbage collection. It runs garbage collection under the hood, generally every few weeks or unless you ask it to. These packfiles are also generated when you push to a remote. Deep-diving into this section of internals is outside of the scope of this class.
[00:13:49] But now you can kind of understand what's going on under the hood when you see that compressing deltas message during to get push. You guys have all seen that. So you can easily poke around the .git folder with newly created repository, but once your repository gets too big you do a lot of work in it.
[00:14:08] Possibly after a push, those objects are gonna start getting to compressed and you won't just be able to peek at them.