CAP Theorem

Netflix

Lesson Description

The "CAP Theorem" Lesson is part of the full, Backend System Design course featured in this preview video. Here's what you'd learn in this lesson:

Jem discusses the CAP theorem, explaining consistency vs. availability trade-offs in distributed systems. Using the example of his old car, he highlights reliability and covers key metrics like availability, resiliency, and consistency in system design.

Join Now

Preview

Transcript from the "CAP Theorem" Lesson

[00:00:00]
>> Jem Young: We're just going to pause on functional requirements to get to more of the non-functional side, and then we'll get some modeling. Yeah, we're here. CAP, CAP theorem. What is CAP theorem? Anybody? Anybody know? It's OK if you don't. That's why you're here. Uh, the same or the tradeoffs between the consistency of the data and the availability. And that if you need data that is consistent to like extreme.

[00:00:29]
Uh, portions of, you know, it's almost, uh, synchronous the way that it's written, the trade-off of that is that then you have less availability you can support through replication and partitions. Yeah, pretty good. Actually very good, yeah. It's, to sum it up, CAP theorem is about distributive systems and the trade-offs you make. And that's important when it comes to reliability. Take my old car.

[00:00:58]
Beautiful, 1986 Toyota Camry. Uh, it didn't look good. Uh, the window, you know, you can only roll it down so far and then it just kept going because like the track was broken. Uh, the radio didn't work. The, what else didn't work? Uh, so many things, the locks on the car didn't work at all. Um, hopefully no one, actually people did break in and take all my stuff, even though there's nothing in there.

[00:01:20]
Uh, it was slow. But you know what it was, it was reliable. And that's OK. Back when I was in college, that was OK. That's all I wanted was something to get me to point A to point B. I didn't care about the other features. And that's what we think of when we think of a good system and it has to be reliable. It doesn't matter how fast it is, how slow it is, all these other things, it has to be reliable.

[00:01:43]
Reliable is the ability for a system to function correctly. Overtime. And overtime is the key, the keyword. Anybody can make a system that works for, I don't know, 1 hour, 2 hours a day, a week. But over a long period of time, when you have millions of users, even if you're at. 90% uptime, uh, in terms of reliability, that's a lot of people, if you're say, Google, and the homepage is down, 90% is not good enough.

[00:02:12]
So reliability is the key metric we care about when we think about designing systems. We break reliably down into some main areas. How available is the system, and that's the proportion of time the system is operational and accessible. So it's a real, a real formula. Uptime, it's uptime over downtime minus or uptime or uptime minus downtime, something like that. Real metric, how available is the system, and you see that calculated in terms of percentages.

[00:02:49]
4 9s, 5 9s, etc. That's what they're talking about, the availability. Resiliency. How well does the system handle failures? We'll talk about this in a second, but failures will happen. When you're talking about millions and millions and millions of events, anything can and will happen. It's just going to, even something that's a 0.01% chance, when you're talking a million, a billion, that's still a very large number.

[00:03:21]
So how well do you recover from failure? We accept that they happen. How consistent is the data? Hm. Why would consistency matter? Think of the app, one of the apps we just designed. It might depend. Because you have systems that are eventually consistent, so there's a small window of time where two users might see slightly different things. Where? Think of a broad use case where consistency is absolutely critical.

[00:03:55]
Transactions for like a bank, yeah. Yeah, what if I buy something online and the database is like, one database didn't register that I bought that and charged my credit card again. That's a huge deal. You can't do that. Our banking app, consistency is the most important thing. Because, you know, money's all ephemeral, it's made up. So, you know, I'm sending Kayla $100. And you're like, I didn't get it.

[00:04:19]
It's on my app, and I'm like, I'm going to send it again. And now I'm out $200 because the database wasn't consistent. But if I'm building, say, I don't know, a party planning app or something like that, consistency probably not as paramount there so much as availability, and people look at it. Analytics is the place where I see this sacrifice, because real time analytics data in large loads gets expensive, but if it's 5 minutes behind, that's not a big deal.

[00:04:50]
Yeah. Yeah, exactly. What's the workflow we're designing for? Does it need to be real time? Some things do, a lot of things don't, but that matters. And that's CAP theorem, um, as a whole, consistency, availability, and we call it partition tolerance. Uh, I like to think of errors. And this was proposed back in the year 2000 by Eric Brewer. So this is way before cloud computing was really a thing, uh, back in 2000, I don't know if anybody remembers those days, the internet was mainly, I think airline tickets and what was the dominant, Yahoo.

[00:05:36]
I don't think Google was as big back then. Hm Oh God. Anyways, cities. Oh, GeoCities, is that even 2000? I think so, early like. Might have been 90s, late 90s, early 2000s like Lycos. Anybody remember that? I'm dating myself now terribly. Uh, but what this was was ahead of the, ahead of time thinking, uh, because distributed computing wasn't really as much of a thing, and they said. Accent. A distributed system can only guarantee 2 out of 3 of these at the same time.

[00:06:10]
Trade-offs, right? That's a tradeoff. And what's one that is, um. Why, why is it 2 out of 3? What, what are the 2 out of 3s here? What are the ones we're actually thinking about? Consistency and availability. Yeah, I'll skip ahead because this is cheating. Why? Because stuff happens. Errors are going to happen. You can never, you can't guarantee partition tolerance. Uh, I'd say, oh, my, my system's never going to have any sort of failures.

[00:06:40]
That's what partition tolerance is. So we throw that out. So it's either going to be consistent and can deal with partition tolerance, or it's going to be available and deal with partition tolerance, but you can't do all of them at the same time. And that's really what CAP theorem is about is like understanding the trade-offs, understanding what's important. You can almost guarantee most of these, but you can't get through a 100% guarantee.

[00:07:09]
So we can only pick 2. And the trade-off if something is, say, consistently available, it's only going to work without network issues, and what sort of system would that be? Yeah, a single server, that's pretty much the only way you're going to get close to a guarantee on partition tolerances. If there are no partitions, there are no network failures. So we crossed that one out, uh, because in distributed systems that's not unlikely to happen unless you.

[00:07:34]
Have a crazy supercomputer or something, but then even then it's still the parts are located around, so this is not realistic. So really you're left down with something that's consistent and can handle errors or it's available, can handle errors, but you can't guarantee both.

Learn Straight from the Experts Who Shape the Modern Web

250+
In-depth Courses
Industry Leading Experts
24
Learning Paths
Live Interactive Workshops

Get Unlimited Access Now