Lesson Description
The "Partitioning & Sharding" Lesson is part of the full, Backend System Design course featured in this preview video. Here's what you'd learn in this lesson:
Jem explains how relational databases scale using partitioning and sharding, outlining how each improves performance. He emphasizes understanding data distribution and choosing the right shard key to scale effectively while preserving transactions.
Transcript from the "Partitioning & Sharding" Lesson
[00:00:00]
>> Jem Young: Let's talk about this. This is going to be a very tricky topic, and I spent way too long figuring out how to explain this in a way that made sense to me, so that I can help it make sense to you all. And this is going to be on the concept of, how do you slice your data? So how do you scale, say a relational database? Because remember we said you can't just horizontally scale it. That doesn't mean it's not ACID compliant at that point.
[00:00:26]
So the only way you can really do that is you have to slice your data up, which seems easy, and there's, but it's a little tricky. There's two ways we do it. There's partitioning and there's sharding. And this is very confusing. That's why I spent a long time and I read a lot of blog posts and a lot of examples, and like a lot of them are wrong or saying the wrong thing and I was like, oh man, how do I make this make sense?
[00:00:51]
So we're going to get it today and you're all going to leave a little bit smarter on understanding, hopefully partitioning and sharding, this makes sense. So partitioning, sharding is a type of partitioning. Partitioning just means you're slicing your data on some field. And you're saying, well, that's also sharding. So this is going to get a little confusing. So say I have a giant relational database.
[00:01:20]
I can't just add another server there. How do I make it faster? I'm at the limit already. So what I can do is I can just pick an arbitrary table, say I have a really big table, a million rows or something like that. I pick a point and I just slice it. In this case, I'm using a user ID. I can slice there. In that way I know automatically when I'm making a query instead of having to query a million rows for something.
[00:01:45]
I can just, hey, I'm only going to use this partition because I know the ID is ID 10. So I'm automatically going to make a fast query. And that's partitioning. It's just slicing your data. But here, here's the secret part, it's on the same database. So it's not a second database, it's on the same database, and partitioning is just, I've got a large database, and we slice the table in half or some division, and I can speed up my queries substantially by doing that.
[00:02:25]
Okay, get ready. Sharding is also slicing your data on some arbitrary key called a shard key to increase performance. However, and sharding is a type of partitioning, but sharding implies you're doing it over different machines. And I know that you're like, oh yeah, obviously, but when you look this up on the internet, it's very confusing to a lot of people, because you're like, oh, sharding, partitioning, they're both slicing the data, what's the difference?
[00:02:54]
It really boils down to, is it on the same machine or not? That's really, that's it, no confusion there. And to do sharding, we pick a shard key. So it could be anything. I'm picking first name as something we're going to start on. So that means every time I write, go to the bouncer, it says, oh, and you can optimize this and say, oh, the first name is Jem, J for Jem. I know that's going to go in database two.
[00:03:23]
And that dramatically increases your performance. So all these things are about hitting the limit of your database and being able to scale up. Sharding is something you're automatically going to do. Actually, the first thing you do when you hit the limit is you're going to partition, because partitioning is always easy. Remember the pizza oven example the other day. The more servers you have, the more complex you have to manage.
[00:03:49]
So our default is always going to be the simplest thing first. You partition, very easy, because you're not adding another database, you're just slicing your data a different way to make it faster. Once you hit the limit of that, then you start sharding. Hey, what's an easy way of sharding across all my data? What's something that's very common across all of it? If it's a user table, maybe username, maybe email address, something like that.
[00:04:15]
And that's your shard key, and you can pick that arbitrarily. And now, this enables you to have a relational database, maintain that ACID compliance, but scale horizontally. And this is really the only way you can scale horizontally on a relational database and keep the transactional guarantees that you have. Now, if we had a NoSQL or non-relational database, sharding happens automatically. Mongo, it's built to do that.
[00:04:43]
It just horizontally scales out as increased loads. Pretty cool. So you really only have to think about this when you talk about non-relational databases or relational databases. So again, relational, non-relational, that's why non-relational are kind of cooler. It automatically horizontally scales where you don't think about it. And just hammer it home because I want to make sure I got it. Because again, this is really tricky to look up.
[00:05:15]
The example is, is it partitioning or sharding? Partitioning is, you're having a pizza party, as we do. And you have a bunch of different pizzas, but they're all at the same house, but you have one buffet table, it's pepperoni, cheese at a different table. Meatball, I don't know. Pineapple and yeah, yeah, you know, back in the corner at a different table, but it's all in the same house. Now, sharding is, you're having a pizza party, but it's a block party and every house has its own type of pizza.
[00:05:58]
So one house has pepperoni, one has pineapple at the end of the block. And that's between partitioning and sharding. Yes, you can combine the two, of course. So if you have, like, 10 servers and you have 1000 different partitions or something, you would spread those partitions, essentially sharding them across those servers, so that when you do want to scale further, you use them further. Expand the partitions that already exist, yeah.
[00:06:30]
You just, you keep slicing. So when you hit the limit of this particular sharding strategy, what would you do? You partition first, you hit the limit of that, you shard again, and that's how you keep replicating or you keep splitting and increasing performance of a relational database. This is tricky though, because if you get it wrong, say the shard key is incorrect, you're not sharding on the right thing, you can overindex on, well, not overindex.
[00:07:01]
You have to understand your data, because what's a problem inherently with sharding that you can think of? Or what assumption am I making here? That the load balancer and the network is very fast and working, you know where to go. Well yeah, I like what you're all thinking. Think more about the data. I was going to say the thing that I guess comes to mind for me is that we're assuming the data is going to be split up into equally sized groups when, like, if we say we're splitting by the first name and like, or like let's say we're splitting by the last name and we have a bunch of Smiths and like that might take up like 25% of the data if we choose to split things into like quarters alphabetically, we still have one section that is overloaded.
[00:07:47]
You nailed it, yeah. If you get your shard key incorrect, well, I'll say the one you're assuming that data is evenly distributed, and it's not. Data is never evenly distributed. But this shard key on first name, you're assuming everybody has a first name that's spread across the alphabet evenly, which is not true. I wish I had some cool facts about people's names and their letters, but we know that people probably with a Z, is probably not as common as people with like an M or, I don't know, Js, probably Js.
[00:08:21]
Js are pretty common. So this is why you really have to understand your data, and it's not just, oh, a simple performance thing, split my database up. You have to understand how your data is going to change over time, because the shard key can change as well. But if you get it wrong, you now have, hey, why is one cluster blowing up and the others are all, we're not being fully utilized. Maybe we chose the wrong shard key, maybe our data changed.
[00:08:52]
So I say all this to say this isn't easy. It seems really easy on paper, but when you actually do it in real life, you get weird, you have edge cases. So now, after, you know, after hours of research, you can all, I'll explain simply between partitioning and sharding. If you want a challenge, go, go look it up and see if like, see the explanation, people have long super windy explanations, when really it's not super complicated.
[00:09:21]
Oh yeah, so you even added more because I really want to hammer it home. Partitioning, your data fits on one server, and you still need transactional guarantees. You have queries that span multiple tables. So we can talk about that with sharding, but one thing you do lose is the ability to query easily across tables, which means your performance isn't going to be as fast. So again, you have to pick a good shard key.
[00:00:00]
If you're doing a lot of complex queries and it has to reach across multiple databases, it's automatically going to be slower than just one giant, giant machine.
Learn Straight from the Experts Who Shape the Modern Web
- 250+In-depth Courses
- Industry Leading Experts
- 24Learning Paths
- Live Interactive Workshops