Backend System Design

System Quality

Jem Young
Netflix
Backend System Design

Lesson Description

The "System Quality" Lesson is part of the full, Backend System Design course featured in this preview video. Here's what you'd learn in this lesson:

Jem introduces system quality, emphasizing reliability, observability, security, scalability, adaptability, and performance. He discusses trade-offs and the importance of considering user behavior, seasonality, and cost when meeting non-functional requirements.

Preview

Transcript from the "System Quality" Lesson

[00:00:00]
>> Jem Young: All right, let's talk about quality, because we're still on the non-functional side, and you're saying, Jem, we've barely diagrammed anything today. We will. We're getting there, we're building to it. But now we'll know why, and we'll know that it's correct when we write it out and that it handles all the edge cases, because we've thought about this critically. So system quality reliability, we hit on that earlier, and that's a lot of what, you know, CAP theorem was talking about the reliability of it.

[00:00:28]
Is it consistent or is it available? So reliability number one. But also when you think of a quality system, is it observable when something goes wrong, and it will, how quickly can you triage that? Someone in class, uh, Kayla was talking earlier about figuring out an issue with the cache was causing an issue. It took days to track it down. Uh, how much logging do you have? You only know you're missing logging when something goes wrong and you have no insight into what's happening and why.

[00:01:01]
That's a terrible feeling. So observability of the system is an important indicator of quality. Security. Who cares if you build this beautiful system, handles, you know, a million RPS, uh, it's amazing, uh, has great uptime, if it's not secure. If anything, just break in, what's the point? If anybody can become an administrator and gain unauthorized access and muck around your database. What's the point?

[00:01:31]
So also critical security. I want to rank these somewhere, but to me, they're all like very important. You can't have a good system unless it's hitting all of these points. Is there a trade-off? Yes, yes, there's a trade-off with all of these. You can't get to 100%. You can have a system that's extremely secure and observable, but what's the cost of doing that? It's a pain in the butt to get around and do anything.

[00:01:59]
You ever worked in a place like that, where it's like, oh, I need to get access to this. Well, you got to submit a form to this person, and they don't know you, they're going to say, no, you missed this field and you gotta go back. There's companies like that, but they need to be secure. It's a lot of steps. But we don't want to build every system at maximum 100% security. It's very expensive. So we have to make a trade-off somewhere.

[00:02:24]
How scalable is the system? If you build a system that only works for one use case, but it can't expand to others, what's the point? That's what we're trying to do. We're trying to build scalable systems, and with scalability comes adaptability. If I want to add another feature, a very well designed system is going to enable me to do that without having to change a lot of infrastructure. I shared that example earlier of, you know, the early days of Netflix when we made the assumption that there are only 3 account types, and it was very expensive to change down the road.

[00:02:57]
Not as adaptable there. We didn't think that far ahead. No, nobody's fault. This is nobody's fault, uh, but it happens all the time and that's why we have these migrations that take years. Anybody been part of a migration that takes a year, more? Why, why does it take so long? Because we made assumptions early on that turned out to be incorrect. And that happens, but a good system is adaptable. It means it can add on different requirements and it doesn't change that much.

[00:03:29]
And of course, who cares about all these other things, if it's not performing, if it's slow. If the latency's high, if the throughput's low, who cares? Well, maybe you do, maybe you don't. But we have to consider performance too. These are all indicators of a system, the quality of a system. You're saying, oh man, that's a lot to think of when I'm designing a system, yeah. Now we see why it's not just a straight line between prompt and some diagram.

[00:03:56]
What are the things you're trading off here? What are you ultimately trying to do? What's the most important thing? Observability, the ability to know what's happening in your system. What are the metrics? Do you have alerting set up? Or do you find out something's wrong because your customers are blowing up your phone or your call center? Yeah, it's pretty common too. Turns out we don't have any metrics in production.

[00:04:23]
Yeah, that's uh, you know, you're an experienced engineer, people, the people who are nodding have been through it before, been like, yeah, we probably should have tests and then some metrics and security. Again, say I'm building a bank. Uh, it doesn't matter how slick the app is, if it's not secure, doesn't matter at all. This is a very famous, very famous issue where people prioritize functionality over security and later you find out, um, got hacked.

[00:04:58]
What was the point? What was the point of that? Uh, I'm going to call out crypto as famously insecure, mainly because they, uh, this is totally my thoughts, but it's OK, we got time. The, you know, the challenge with, hey, we're going to throw away the old banking system and make a new one, a new monetary system. Great idea in principle, but we're forgetting this concept called Chesterton's fence.

[00:05:22]
You ever heard of that concept? No? I thought I said a different workshop, but Chesterton's fence is this idea of, um, well, the story is there's a person walking down the road, they see a gate in the middle of the road. They think, this is dumb. Why is this gate here? And they open the gate. It turns out there's a really mean bull on the other side of the gate they didn't see. But they just assumed that the person before didn't know what they were doing.

[00:05:45]
They're like, that was dumb. Uh, this is very famous in security. People are like, oh, this is dumb, why do they do that? Why do banks have such strict regulatory compliance and logging and auditing and all that? Uh, and this is what a lot of crypto, the industry has learned. There's a reason why things are slow, and the reason why it takes so many hops to go through things and you learn that when you're losing hundreds of millions of dollars.

[00:06:14]
I'm not knocking crypto, it's just, you know, Chesterton's fence, uh, assume people are smart and they know what they're doing before you. It's a fairly safe assumption. Scalability. We think about scalability in terms of how do we scale up. We're always thinking up. Yeah. What's the problem with only thinking it up? You'll hit a max capacity at a certain point on a particular system. And there is a scale out option.

[00:06:45]
Yeah, we talked about, we touched on this earlier a little bit. Yeah, we haven't talked about costs at all, but that's the thing, it's not cheap to build these systems, you know, millions and millions of dollars. So, do we need a system that's running max capacity, it can handle lots of traffic at any given time of the day? No, of course not. That's such a waste of resources. That's why we do cloud computing, so we have the ability to scale up or scale down and the system should be able to do that pretty effortlessly.

[00:07:15]
That's something to ask when we think about non-functional requirements. Um, sometimes we call that seasonality. Is your app busier at certain times of the year? Amazon would be a good example of, or any sort of e-commerce, Black Friday, got to scale up. Do you need that same scale to scale come January? No. You got to be able to scale down too. So scalability and adaptability. User behaviors change.

[00:07:44]
What you build today, you might find out with your MVP isn't actually the product market fit, you need to do something different. How adaptable is your system? This one is the trickiest, uh, in my opinion, because at some point you have to make assumptions. We learned that with the CAP theorem exercise. You have to make an assumption and go with that and design your system around that. But if you can do things well, you're not bitten too hard by those assumptions.

[00:08:17]
And a word on performance, latency, how quickly does the system respond? Important sometimes, other times not as important. Um, transactions, the consistency is important. The latency, it depends. If I'm doing a bank transfer, I don't expect it to be done in a day or two. So, not as important there, but if I'm buying something, I want that to be performance and consistent because someone's going to drop, hey, this shopping cart's spinning over and over again.

[00:08:46]
I'm out, that's a real thing, people are going to bounce. I think there was a study on Amazon's like ninety-ninth or ninety-eighth percentile of, uh, how long request took that. That was because those users had a bunch of stuff in their carts, so they're actually like losing a lot of money for every like percentile that went up or something like that. Yeah, I think it was a, every 100 milliseconds or something was some large amount of money, yeah.

[00:09:17]
Yeah, it matters sometimes. Other times, yeah, you can get away, you have a pretty big margin. It depends. Everything's a tradeoff. And throughput, who cares if your system's fast, if it gets bogged down really quickly, if you start sending a lot of data. And what's a lot of data? Something like video streaming, tremendous amount of data. Uh, Netflix famously uses about 30% of the bandwidth of the internet, or used to, it's a little lower now.

[00:09:50]
The Internet for streaming video. So is latency more important there or the throughput? I know both, but the throughput. Now if it's going to get bogged down, it's going to bog down for everybody. It's a lot of data you have to handle. Um, other things taking a lot of data, um, say mapping software, real time mapping, uh, autonomous cars that are taking in, you know, millions of data points, the throughput's really important there, not the latency as much.

[00:00:00]
Some AI models, AI models, yeah. So even when we think about performance, oh, my system's fast. Is that the most important thing? It depends. Not straightforward. But we have to hit generally the best we can on all these, but it's going to be a trade-off somewhere.

Learn Straight from the Experts Who Shape the Modern Web

  • 250+
    In-depth Courses
  • Industry Leading Experts
  • 24
    Learning Paths
  • Live Interactive Workshops
Get Unlimited Access Now