Backend System Design

Data Storage Overview

Jem Young
Netflix
Backend System Design

Lesson Description

The "Data Storage Overview" Lesson is part of the full, Backend System Design course featured in this preview video. Here's what you'd learn in this lesson:

Jem introduces common system design components, stressing domain understanding over structure. He covers scoping problems (and stating what’s out of scope in interviews) and explains when to choose relational vs. non-relational databases, emphasizing simple solutions based on data and storage needs.

Preview

Transcript from the "Data Storage Overview" Lesson

[00:00:00]
>> Jem Young: You know, one of my reflections from what we're going to talk about is just the similarities between every application, those components we laid out yesterday, those common system design components. That's every system in the world to some degree. We can build anything with those, and it's just a few set of components, which is kind of amazing. But the challenge is understanding the domain, and that's what's tricky when you get into like an interview where you have to think differently, especially when we, you know, I noticed it yesterday when we did the banking app, and then we switched to the URL shortener app, totally different things we're thinking about, even though it's a system design problem.

[00:00:38]
So it's really easy to get confident, overconfident, and when you drop in, you're like, oh yeah, I've done one, I can do another one, and then you have to think it's not the system, it's the domain, and the system comes around it and we're going to walk through some design problems today, just like that. It's going to be like, here's the problem, and we'll see. It's the same components, there's a server, load balancer, some sort of database, cache and all that, but it's the domain-specific problems that we have to think really critically about, and that's where answering questions comes in.

[00:01:12]
But also, we're going to talk about scoping even more, and we didn't, something we didn't talk about yesterday was saying something is out of scope, and that's totally fine. And I probably didn't mention that enough, which is, you can say in an interview, this is out of scope, I'm not going to do that, and the interviewer is going to say yes or no. But we want to make it easy on ourselves and scope the problem down as much as possible because when you make it so large, you're just going to be like, ah, what should I do first?

[00:01:36]
So you make it small, and if you need to, we can make it bigger. That's the approach we're going to do today. I think I share this quote to some degree with every workshop I've ever done. But this was one of my favorite professors in college said, you know, at the end of the day, most of what you're doing is reading and writing from a database. And I, you know, in my youth, I said, no, like, I'm a programmer, I'm doing complex calculations, making beautiful UIs.

[00:02:02]
I don't know what programmers did in college. Now I know it's, I won't say depressing. No, no, no, but it's exciting, but it's also mundane at the same time. But this is the truth, like what do computers do? Store enough. Store and read data. Yeah, they manipulate data. Like that's it. Without the data, what's the computer doing? Nothing. There's nothing to compute. That's why when we think about data storage, why it's important.

[00:02:34]
But at the same time, sometimes all the things I read online is people maybe overindex on the database, when really you just need a few key patterns, and you'll be OK, unless you're getting to a really, really deep domain-specific problem where they need multiple types of databases. Usually just knowing the high level stuff, the top three databases for a particular problem, you'll be OK. So don't sweat having to know everything, just follow your instincts.

[00:03:00]
We're going to talk about relational databases, non-relational databases, and really, four or five different kinds, and that's all you need to know, really. I know someone on the internet right now as well, actually, yeah, the other thing I know about databases is they all kind of do the same-ish thing. They're all OK at it. It's only when you get to millions and millions and millions that the differences become clear.

[00:03:28]
So a real example is everybody says, oh, don't use like MySQL or something. That's old. MySQL is not cool. Most popular database in the world, but it's not cool. I'm like, why? Oh, well, I can use MongoDB. It's a document store, it auto scales out. I'm like, yeah, but do you need to? Like, well, you know, maybe. It's like, well, use the simplest thing you need. So, generally, when it comes to databases, I always start with a relational database because it's the simplest, easiest thing to reason about, well supported.

[00:04:03]
SQL is a really straightforward language. I say that not being able to do a left join still, I still can't explain it. But you can look at it and say like, OK, I understand what's going on. Select star from users. Cool, I know what's happening there. I don't even need to speak database languages. So that's the other thing to keep in mind with data storage. Use the simplest solution, you can always scale up later.

[00:04:23]
The thing you have to get right is your schema in the beginning with the relational database because that's really hard to fix. When it comes to thinking about the database, there's a couple of dimensions we want to think about. Again, you got to be in your domain. What's the domain doing? What type of data are we moving around? Is it structured data or is it unstructured data? So structured data would be, you know, I have a, think object-oriented programming.

[00:04:53]
That's what SQL data, or relational databases are for is object-oriented programming. That's why they're very, very similar. There's a relation between things just like in object-oriented programming. There's a relationship between the objects. That's structured data. Unstructured data is, hey, I've got a JSON blob I need to store. There might be some overlapping fields, but generally, you know, it's pretty loosey goosey.

[00:05:20]
That's unstructured data. So there's relational for structured, and then there's, you know, non-relational for unstructured or object storage. It really depends on the type of data you want. Persistent versus ephemeral, we don't, that's why we call it data storage. I say database, but it's data storage because if you have ephemeral data as in I only need to store it for a little bit of time. I just need to store it on the machine.

[00:05:48]
That's a way of storing data as well. I guarantee you all do this already. You store things in memory, a variable of memory. You know, that's data storage. It's something you have to think about, especially at a large scale, or you have persistent, which is, I'm writing to disk, it's going to be there when I need it. Remember the discussion we had yesterday on, it was the to-do app. Is it a write-heavy or read-heavy?

[00:06:16]
What do we decide? Read, read, read-heavy, yeah. Most things are going to be more read than writes. There are exceptions to that. What would one of those be? I was trying to think of that yesterday and I couldn't. Nothing came to me. Our app that we build ended up being write-heavy, which this was the thing that got a lot of people, because we assumed read, everyone always assumed read-heavy. Our studio app.

[00:06:45]
No, no, it was what our company builds. It's a lot of user-entered content. Like the bulk of the app is customers uploading content and doing stuff with content, so it ends up being quite write-heavy, but there is a cycle where they do a lot of writes, and then the project is considered launched, and then it's read-heavy from there. So we have a lot of cycles, but we overindexed on thinking we were read-heavy when people were, we used to be SQL, so they were like, let's just use SQL, let's do this, and then for other reasons, we ended up switching to NoSQL with Mongo.

[00:07:21]
And when we were doing that architecture, people assumed we were read-heavy, so we pulled the metrics, we're like, no, actually, we're write-heavy. Interesting. Metrics collection would be write-heavy or like a clicker, yeah. Those, that's what I go to when it's write-heavy. So I also, I'm like, what is a write-heavy? But logging or like any sort of eventing, you said metric collection, yeah, those are going to be consistent.

[00:07:46]
You're always writing. Hopefully you don't have to read too much. Sometimes it means something went wrong and you're parsing your logs, but there are databases optimized for being read-heavy or write-heavy. Have you ever had to split your database based on concerns like where you have a read-heavy concern and a write-heavy concern and they're actually handled separately? We're going to talk about that.

[00:08:13]
Yeah, have I had to do it? No, I don't mess with databases until I have to. But my wife is a data scientist, so bread and butter, just. I don't know, like, I think she sees like tables in her head and she's like, oh, it's just, it's a massive query, and then there's a subquery and I'm like, what's a subquery? It's like, uh. I know enough to be dangerous, but not enough to actually have to maintain my own database.

[00:08:49]
What's an example of something, an application or service that's extremely read-heavy? Like it's, I don't know, not even 100 to 1, say like 1000 to 1. Data analysis, any sort of like data warehouse where you're doing huge queries on like thousands and thousands of records. Probably IoT scenarios. Reads, yeah, I was going to say something like npm where, like the vast majority of people are downloading stuff, but like hardly anybody's uploading.

[00:09:17]
That's a great example. I didn't even think of that one. Yeah, npm, the web in general. Yeah, yeah. And it's funny, it turns out most things are read-heavy. The vast majority of things. Now that ratio depends. You can have a 10 to 1, which is actually considered pretty write-heavy still, or you have something like 100 to 1 reads to writes, or 1000 to 1 like, like a social media app is going to be all reads.

[00:09:51]
People are just scrolling. So, again, it's the domain, you have to think about that when you're going into the problem and think, what's going to be the bottleneck when I get to a million users. And then we have consistency versus availability. Anybody remind me what there's some sort of, what's it start with? The theorem. CAP theorem. CAP theorem, yeah. Remind me what CAP theorem is. Consistency, available, partition, trade-offs, trade-offs, yeah, can't have both.

[00:10:34]
We can't avoid partitions or network failures, it's going to happen. So we have to decide what's more important. And we talked about that we did an exercise on consistency and availability, where we said, hey, something consistent needs to be is, what's an example of that? Needs to be consistent. Transactional. Yeah, financial data, health. Potentially health, yeah. Yeah. And then something that should be more available.

[00:11:03]
What's an example? Social media. Or, what's an example people use? Air traffic control. I was going to say needs to be available. And it's, you know, it's, these aren't black and white things. It's such a fine margin between the debate of if it should be consistent or available, but it, you know, it's just, just show your choice. The medical to-do app, we definitely had a disagreement on should this be consistent or available, you know, and it's OK to have an opinion, but just know, what are we optimizing for?

[00:11:40]
And again, these aren't binary, it's a spectrum. It's always a spectrum of things. So when it comes to data storage, the two main concepts we think about are relational and non-relational. Relational, there's some sort of relationship between the objects. Let's say we want to expand our pizza shop, even bigger, super pizza. But we don't just want to sell pizza, we sell other food too. We'd still use a relational database there, because all the food has a relation to each other and probably the ingredients overlap at some point too.

Learn Straight from the Experts Who Shape the Modern Web

  • 250+
    In-depth Courses
  • Industry Leading Experts
  • 24
    Learning Paths
  • Live Interactive Workshops
Get Unlimited Access Now