Transcript from the "Denormalization & Document Structure" Lesson
>> So that's the heavy part of all the query and now I'm gonna talk a bit more about what we haven't seen and that's joining. So in SQL tables, or SQL tables, we see that we always join by having a foreign key, so in this case, we have tbl users with the UID of David 123.
[00:00:18] And then that joins out to the UID in the expense of David 123, and we can see that David 123, there's just two expenses in that table. And so writing a query, I can get that user back, so I can select all the properties I want. And I can actually enter join and I can get back a whole column of the ID, the cost, the date, the user's UID and their first and last name.
[00:00:43] And it comes back as one big result set. So anytime I need to get data, since it's all denormalized I have to query and combine it all together. With no SQL, depending on your use case, this is a lot different. In the same way where you saw that I queried to get that user, it's very common to say, all right, when I have a user's expense.
[00:01:06] Or when I go to expenses, instead of just putting the UID for user, I'm gonna put the whole user object in there or things that I feel that I need. I've first written there twice, should be first and last. When you look at that, a lot of people who are used to very highly normalized structures are like, whoa, that's duplicated data.
[00:01:29] What happens when David needs to move from Washington DC to New York City, do I have to update all that? And the answer is, well it depends. And a lot of systems, how often is someone moving cities? So how often is someone updating their profile? So in cases like that where you know that data rarely changes, denormalizing it tends to be a really great option.
[00:01:52] Because then I don't have to worry about querying, I'm still prioritizing my reads over my writes. And if they do update something, I can trigger a system known as fan out where I say, okay, let me find all their expenses and let me update all that. You can do that when they update their profiles, or you could do that even on like a crown job style basis depending on what your system has.
[00:02:13] And this is something that's very common in no SQL systems, is just copying data, but being aware of where it is by, in this case, we have the UID. So at any point in time, if I need to fan out that update, I can do that. And then what this allows me to do, is I can still write a query like that where I can say user.uid = David 123 but I get back the whole record.
[00:02:37] And so instead of doing a join, it's all readily available for me. Also, another thing about denormalization is that a lot of people, when they think they come on to a no SQL database, they get kind of afraid. They're like, I'm gonna have to copy data everywhere, I'm gonna update data everywhere.
[00:02:55] That's kind of frustrating. I'm used to normalized systems. I don't like that. Well denormalization, tends to be most useful when a, data that's embedded is rarely updated, because then it's pretty easy to manage. And also when you're dealing with one to many relationships. Because you can do somewhat of a join with Firestore by saying, okay if I have one user, I can get many of their expenses.
[00:03:23] So, in this case, I'm querying for the active users expenses, and I'm getting the current user with something like Firebase off. And so, I can get that user information from the database, and then because I have their user information, I can create a realtime listener. And then now I can merge those two things together.
[00:03:45] So those are two separate calls, but I have authentication information from one and I can combine their expenses in as well. So this is another way, this works really well. If you have a one to many situation, this is what I tend to do most of the time.
[00:04:01] When it's say, we just get something. I have the active user, I have a way to get their data. If it's one user, many objects, that's pretty easy to put together. Denormalization tends to be more important when you have a many to many situation. If a expense could have many owners and then I would start looking more into embedding the dock.
[00:04:28] Because I would have to do a multiway read to get all of that user's information back because they're located in separate spots. But if I know that I'll have two users per expense I can just embed them in there and there's no join. So many to many relationships, I find that that tends to be a good place to denormalize and copy your data over.
[00:04:50] One to many you can get away a lot with just saying, hey, I'm gonna do something that's like a join. So it's important to take away here is that a lot of people think like when I join. Or when I start with a no SQL database, I have to let go of all these normalized standards that I have or all these things I'm used to.
[00:05:07] And to a degree, yes, you have a whole new world of flexibility where you don't need all the same columns. You don't need to follow a very strict data structure anymore. So use that to your advantage. However, it doesn't mean that all of a sudden, data just has to be copied everywhere, and everything needs to do that.
[00:05:23] It's more of a spectrum, where you say, okay, I'm going keep these separate and I'll do a query for these, but these ones I'm gonna embed on. So it's more of a spectrum rather than just saying, now that I'm using no SQL, I'm gonna go all the way with denormalizing everything.
[00:05:44] And one of the things that's gonna help out a lot with denormalization and just with querying in general is hierarchy. So, in Firestore, as we've seen, we have collections and documents, but documents can have sub-collections themselves. And so, this right here creates a very logical path for us to read data, and it makes it easier to read data without writing any queries whatsoever.
[00:06:13] So, let's say within the users collection, I decided, okay I don't want to store expenses out as its own collection. I'm going to embed expenses within each user's document. So I would have this new path where I'd say slash users, slash the user ID, but then slash expenses.
[00:06:35] And so now if I want to get us a user's expenses, I can do that without writing a single query because of the way I structured my data. And it's this kind of hierarchy is really the hallmark of no SQL databases. I'm able to find paths that kind of do my most common queries for me.
[00:06:53] And since Firestore does have limited queries, this is also one less field that I have to worry about restricting on. So I don't have to say where the UID is equal to this. I can just say hey, UID is already taken care of in the path for me, I now have two more fields that I get to do composite queries by.
[00:07:14] And this is really what the query looks like. it's just get me David 123's expenses, and queries in Firestore are shallow. So just because I'm getting back a user document doesn't mean I'm getting back all of their queries. If I get back a user document, I just get that document.
[00:07:31] If I want to get their queries, I will need to further nest down like I'm doing here. Yes, so yeah, you only get the document, you do not get it sub collection. So that allows you to follow the structure without worrying about just vacuuming up all the data below.
[00:07:47] The real time database and the other database we have does not work the same way. Queries are deep. So in the way that if you query one thing, you do get all of the data underneath it. So it has a bit of a different data structuring pattern where you wanna keep things flat as possible at the top.
[00:08:06] But now if we break everything out though, into if we embed use expenses as a sub collection of users. What if we still have a requirement in our document or within our app to say, well, I still need to see all the expenses across all users. Would I have to query across all, get all my user IDs and then create a bunch of separate queries and put them all together?
[00:08:29] Well, Firestore knows that this is a big issue, so if we're using this common expenses sub-collection. Where we have David 123 has expenses and Darla 999 has expenses, and Juan 217 has expenses. Well, couldn't I just query across all sub-collections called expenses? And that's exactly what you can do with a collection group query?
[00:08:56] So right here you can see we're using this function called collection group and it's just has one field or one path on it called expenses. And it will look across the database for any sub-collection named expenses, and it will perform a query on it. So by the fact that I'm structuring my data this way, I'm getting a lot of the best of both worlds.
[00:09:17] I can get all of my user data back if I want, or I can get all collections just for one user with a simple read by following the hierarchy path. Or I can get all expenses back for all users with collection group queries, so allows me to do either.
[00:09:36] And collection group queries, also, they work just like other queries. You can continue to restrict, and do other things. So in this case, I can say hey, get me the category, give me any expenses across all users where category contains food and the cost is between $200 and $210.
[00:09:57] So that allows me to do both of those things. So, I'm gonna show off how collection group queries work.