Autosave System Design

Netflix

Lesson Description

The "Autosave System Design" Lesson is part of the full, Backend System Design course featured in this preview video. Here's what you'd learn in this lesson:

Jem introduces autosave for to-dos, covering trade-offs like save frequency, read write ratios, and cross-device syncing. He also highlights added complexity around permissions, metadata, and collaborative editing.

Join Now

Preview

Transcript from the "Autosave System Design" Lesson

[00:00:00]
>> Jem Young: All right, let's try something I haven't practiced, but I want to throw it out. Let's try the idea of this, at least scroll over. What if we added a new feature? Well, let's say we want to auto-save people's to-dos as they're typing. And by the way, I haven't planned this out, so I don't know what's going to happen. But let's talk it out. If we want to implement auto-saving for to-dos, before we write anything down, let's talk about what do we think will change.

[00:00:31]
Like what do we need to change? Much higher rates, much higher rates, yeah, dramatically changes the read-write ratio of what we're doing now. Yeah. Anything else? Your reads might also have a problem unless you look at like optimistic, right? Yeah, yeah. How often do we save? That's a big difference in traffic, especially at our volume, saving every minute versus every character, very big difference in size of what we're reading and writing.

[00:01:08]
We could hopefully debounce the input though, right, where we're not doing every character, but. Yeah, debounce, classic interview question as well. So one thing that comes to mind is depending on how sophisticated the application is, you would need to make sure the auto-saves are available across potentially different devices. Yeah, we can't assume there's just one client. So we have to pick an auto-save interval that makes sense, where, hey, common use case, I'm on my desktop, I'm leaving, I switch to my phone.

[00:01:45]
How in sync do we want that to be? I'd say most cases, most people aren't pretty biased towards that level of synchronicity between the two, but maybe we want to design something really performant, we could do that. What's the tradeoff we're making there? We have a lot more writes to do. I'd say another one we have to consider is do we want to auto-save everything? Knowing that someone's typing a sentence and say we're saving every character, is that worth saving?

[00:02:13]
Like, where's the benefit of doing that? More like a debounce then, like some kind of latency check or something. Yeah. Oh, could we auto-save like we're talking earlier to either local storage or some sort of SQLite on their device and then systematically, you know, up or send those every like, I don't know, 1,000 characters, every 1,000 local saves, we send it back to our base, yeah. I like that.

[00:02:44]
Again, there's many ways to do this, so it's whatever we say it is. But let's talk it out. We add a new text box. What does Daniel Tiger say? When we do something new, let's talk about what we'll do. So what are we going to do here? What is necessary to implement auto-saving for to-dos? What do we need to figure out? How often you want to save? How often, yeah. How often to save. OK, what else? How to handle conflicts.

[00:03:38]
How to handle, or say, strategy to handle conflicts. What do you mean by conflicts? Um, if it's only the single user and you allow multiple devices, it would be the conflicts between your own account data and your own things, but if you allow multiple users to collaborate, then you also have to deal with that. Which is effectively similar, multiple clients. Yeah. That's, uh, it's something to think about when we talk about adding features to something.

[00:04:13]
Auto-saving even in this concept with just one user in one account, easy concept. We had this idea of multiple people can edit a document at the same time. Much harder problem, much harder problem. But if you look at that on paper, you're like, oh, how hard can it be? You're already doing all the work, right? We've got this whole system here, it's very scalable. Why wouldn't it work? But multiple users, now we're talking about, oh, now we need to do permissions because every document can have permissions now attached to it.

[00:04:43]
When it's just one user, you didn't have to worry about that. The permissions were baked into the database. Now we have multiple users. Some might have permission to edit, some might have permission just to view, some might have permission to comment like Google Docs. So we added, we need to add an entirely new database, which is just permissions. And then for the lists, now we need metadata attached to that, because that metadata has those permissions attached, who can share what, who last updated, and that's all separate from the actual content itself.

[00:05:18]
So, something to think about when we talk about adding new features, now we're in systems thinking, we think, sure, not just a front-end perspective on how to implement it, but how would that work on the back-end? How would we save all of that? But let's say for now, let's be kind to ourselves and make it easy. We're only auto-saving, only one user. One client or multiple clients? Say multiple clients.

[00:05:44]
So keep it fresh. So how to save, strategy to handle conflicts, is there anything else we're missing? I feel like we probably need to figure out how we're going to pull information from the servers, especially since we're supporting multiple devices. We need to make sure that like if device one makes a change and device two just kind of has a screen open, it needs to be able to detect, OK, like some new kind of changes happened in the past couple of minutes.

[00:06:14]
Yeah, that's a tricky one too, because now we've introduced some sort of polling mechanism, which requires a different server. That's a UI UX problem, but it does, there's a backing element to that, yeah. Say notifying clients of change. We might keep that in scope for now, just to keep it simpler, but it's good to recognize. We might need to ask like what we're going to save because that might have repercussions for our database.

[00:06:56]
Like some might more than others, like updating a task might happen more frequently than changing the name of a list. Yeah. What do we save? Why does that matter in our system design, in our system? The cache. Well, caching, but also it influences the size of the payloads, yeah. Changes our volume and throughput a lot. If we're auto-saving the entire list every time, that's what we want to auto-save, that's a much bigger volume than if we just save a task.

[00:07:28]
Or if we want to get fancy, but it's going to be more complicated, it's a trade-off, we can just say, what's the difference between this change and the last change, we calculate that, we only save the diff. But we need our own algorithms to do that. But it's going to be the smallest and fastest way. That would also impact, depending on your database design, what database objects you might be updating.

[00:07:56]
Like if there is even a lot of conflict versus like smaller things going through. Yeah. It's a smaller payload going through, if you have a diff, then. And we're using Postgres, we would just be doing update statements, right, with the. Yeah, just like. Wouldn't be like saving each auto-save record. It might be. We could. It depends on how we want to structure it. I know, system design is sometimes not satisfying.

[00:08:33]
It depends. What do we want it to be? OK, so now we have our, we'll call these the other questions to ask when determining our functional requirements. I tricked you all. We just wrote some, we're going to do some functional requirements, but we already asked the questions. So now we can answer them pretty easily. How often do we want to save? Pick a. Doesn't matter. Every minute. Every minute. That seems reasonable.

[00:09:13]
How do we want to handle conflicts? Last one wins. Yeah, easy, easy. Not fine kinds of change. We'll say like, I don't know if we'll implement it, but how would we do that? More frequently than the writes, maybe. Twice as much. I think I'd be leaning towards the long polling. Why long polling and not server-sent events? Mainly just because of the cost of keeping the connection alive. Like, especially if you can long poll like every couple of minutes, I think that's potentially going to be less stress.

[00:09:43]
We could. I'd also say server-side events might be easier because then it's just pushing to the client and eventually it'll get there, but we can say long polling, server-sent events. Essentially just doing the same thing a client would do, like it opens a TCP connection, makes a request, and that's it, right? Yeah, the client's just keeping a port open for listening and you can keep a port open for it, yeah.

[00:10:06]
Well, you do, obviously you can't just push a server-sent event to any client because that would be weird. You normally don't. That's a weird security vulnerability. So like I'm on a random website and it can just push information to me without me asking. You can get a weird state right there. So you do have to do some handshake between setting up a server-sent event and accepting like, hey, I'm allowing you to do this.

[00:10:32]
If you go with the eventing approach, you have less to deal with with timing on your reads versus writes of inconsistent state that people are seeing and then maybe people are clobbering each other. Because it should be up to date or you have, if you're writing every minute, you probably want to read every 30 seconds. Yeah, well, I was going to say I guess for something like Notion or Google Drive, if there's going to be a lot of collaboration, you probably want to reconcile things a little bit more frequently.

[00:11:06]
So in that case, especially if we have things modeled as diffs and the size of the payloads coming in is pretty small, that might be a good case for server-sent events or WebSockets. Yeah, WebSockets, yeah, something I didn't bring up when we talked about WebSockets, but there is a downside to WebSockets because it's kind of like, it's a free lunch, why would we even use server-sent events? And I won't do this rhetorical.

[00:11:34]
What do you think the downside is? So the downside is WebSockets are stateful because it is a persistent connection. So what does being stateful do to our system? It means if we try to scale up or if we try to scale down, we can't do that really easily. We have to somehow migrate all those open connections to the new server, which is fraught with peril. That's why when we build architectures, we never want it to be stateful.

[00:11:59]
It does limit our ability to scale, and WebSockets are inherently stateful. We always want stateless wherever possible. So that means every new request, every new response, fresh. There's no context aside from it, that means we can scale up and scale down really easily. So that's the downside of WebSockets. If we did want to state, we'd want to have a different service, probably just for that WebSocket service.

[00:12:23]
Exactly, yeah. But now we're adding more complexity to it, so now we need a whole bunch of servers too, but you're right, that's what we would do. I was going to ask really quickly just to make sure I'm understanding, so what you said about WebSockets, does that also apply to server-sent events, or is there something in the protocol that kind of protects that? Yeah, any time you hear the phrase persistent connection, think something on the other side has to manage that, has to give a heartbeat to the other one, or, you know, syn-ack, doesn't matter.

[00:12:53]
Somebody has to keep that open. That means maintaining state because the server has to be aware of what you're talking to. You hit the same thing with long running requests. If you're doing like a deployment during a long running request, you got to move all the connections over to the connection pool. Connection pools are a nightmare. You have to deal with all that. They are, and again, everything's a trade-off.

[00:13:19]
We can do it. Do we want to manage the complexity along with it? It depends. Maybe yes, maybe no. The bias is always towards not doing it and keeping it the simple way until we actually have to do it. We run into problems as engineers when we try to over-engineer something, not thinking of the cost of doing that, because you think, oh, I'll just over-provision my system, I'll over-scale it. This architecture can handle 5 million requests.

[00:13:43]
Now I never have to worry about scale. But you're also adding a lot of operational and maintenance costs that you're going to have to pay. So hence, you know, it's easy to look at systems design and be like, oh, why aren't all systems designed that way? Because we live in the real world. In the real world, we don't want to maintain, you know, all this infrastructure unless we really need to. That's a lot.

[00:14:18]
OK, last question, what do we save? I like the diff. Like a PATCH request, that'd be cool. We can do a diff. I will say that it is more complicated, because now you need a diffing algorithm and you have to reconcile that somewhere. What layer do we reconcile that? The database? Sure, but now we're going to slow down the database because it's got to do these calculations. Some databases have this built in, they can do diffing.

[00:14:49]
That is beyond my scope of my expertise, but I assume it exists. We could do diffing at, well, we can do it on the client. We could do the diff, but that isn't as helpful. We can create a new service that just does diffing, but now we're adding another service. But for just saving tasks, tasks themselves are small items, presumably. And we need to do a diff with that if we're not saving entire lists during auto-save, which we wouldn't need to.

[00:15:18]
I would do that. I would just save the task. Very simple. You don't have to reason a whole lot about it. If we want to just make this dead simple, every auto-save just writes into the database and it's just writes into that cache flow. It's going to be pretty quick. You know, if you had a high volume, tasks are large, then you want to start thinking about diffing. But again, don't make it more complicated than it needs to be if you don't need that yet.

[00:15:52]
OK, we got functional, we have the answers to all these. Non-functional requirements, not as important here because it's our service, we say what it goes. But pretty good. Cool. Let's implement this. Don't feel bad. I know what you're thinking right now. System design is all good in theory, and then you sit down and you do it and you're like, uh, it's hard. I mean, that's why we're doing it together now.

[00:16:17]
My thought is if we're saving every minute, we need some kind of cache on the client to keep those in state for us so that every minute it has something to pull from to send it up the chain. Yeah, I like your thinking. Yeah, yeah, I was going to say probably like having a dedicated layer on the front-end, like not necessarily like a back-end for front-end pattern, but more just like a separate layer in the client architecture to just go, OK, like, yeah, we're going to handle all the data fetching concerns here and then expose that to the rest of the application as we need it.

[00:17:05]
Yeah. All of our system design so far has been focused on the back-end. We can't forget about the front-end. We can do a lot with the client. Free computer. It's already running, we've got control over it to some degree. So what'd you say, Nick? Caching layer? Yeah, I like that. Every time, sorry, I like draw.io, but it just loses my style. I'm very particular about it. There we go. All right. Is it still, all right, doesn't matter.

[00:18:03]
Cool. What's that? A client cache. But here, connection. Yeah, whatever, do it the hard way. Shape. OK, so we have a client cache. And what should we say about it? What's happening on the client? Let's just, you know, let's write it out. Probably has the latest data for the client, whatever they've last saved. So let's go step-by-step. Let's think, I'm writing a task. So say the workflow, write a task.

[00:18:50]
And then what? Writing it directly to the cache. What's next? Update the cache needs to get updated to go through the reverse proxy and eventually get to the database, yeah. The cache is good, but at some point we want to persist it to somewhere that matters. How do we do that? Send a write request. Like a POST or PUT. I'll put this for now, save to database. To the API. So my thinking here, and we're doing this together again, I've never done this before.

[00:19:19]
We're doing it live. My thinking is, we've written a task, we saved the client cache, we need to write it to the database, so periodically the client needs to write from its cache into somewhere. We could make it easy and just say it's writing to the database directly and call it a day. That actually wouldn't change anything here, because we're just writing, because we're only doing the task, we can write to the task row here.

[00:20:06]
We've already sharded our database, it's going to be pretty fast. It's still a lot of writes. But if we're only doing every minute, that's actually pretty slow, even on our scale. But if we were to scale up, maybe something I would consider doing is we have a cache somewhere down here. And another connection, blah blah blah. Like a server cache. Yeah, I'll explain, this is my thinking, and I could be wrong.

[00:21:03]
Yeah, move the cache, not the arrow. So my thinking is if we wanted to increase the increment from every minute to say every second, that's a lot of writes, but most of them are not going to be meaningful because we want to wait for someone to finish what they're doing and then actually save it to persist it to the place where it matters. So I would maybe think about implementing a strategy where we update the cache, we're updating cache constantly.

[00:21:45]
Once the cache, so we need a service here actually. And add another service. We'll see. Auto-update service. And draw some arrows. So I'm going to say I have an auto-update service, it checks the cache. If that particular task hasn't been changed in 30 seconds or something, the auto-update service is going to take that from the cache and then it's going to write to the database for us. And it's going to clear that row of the cache to free it up.

[00:22:19]
Does that work? Y'all tell me, I'm making this up as I go. That would work. OK, any dissensions? Well, I have a question on why, if we're doing a minute, if the ask is save it every minute, what's the 30 seconds then? Oh, I should have clarified. Let's say we did this every 5 seconds. If we're doing a minute, that's a pretty long period of time, so we probably don't need this whole caching auto-update service business, but if we're doing it really quickly, again, it's not the technology standpoint here, it's what's the domain?

[00:22:47]
Like what is actually happening? How are users using the service? Are they going to be adding new tasks every 5 seconds? No. Are we going to catch people typing, changing their mind, and all that, and then they're going to settle on what the final task is going to be? Yeah, that's just how humans write things. So in that case, let's save our database the trouble of all this thrashing on the writes and updates.

[00:23:16]
We'll save it all to a cache. When it's ready at some point, whatever interval we decide for the auto-update service, we can pull it from the cache and then we write to the real database, and that way we cut out all this other flow that's unnecessary and we don't have to make all these hops for every single update. We can do it pretty quickly. So essentially you're kind of almost kind of streaming the data into larger chunks and then writing those chunks versus, like you said, thrashing, constantly writing all these small writes.

[00:23:50]
There of course is a trade-off here. What's the tradeoff? Yeah, everything's either consistent or available. Which one are we biasing towards here? Availability. Yeah, because if our service goes down, we're not really consistent anymore, exactly. Yeah, because we're not writing to the database until some interval. If something happens in the interval, everything in the cache is lost. We lost that auto-update.

[00:24:21]
That's a trade-off we're making. That's a trade-off most things that auto-save do for you. They're not saving every second, it's every minute or so, and that's the understanding of the users. Yeah. What does it say right here? Last change a minute ago. You know, that's what happens with auto-updating. And we didn't add in that logic here, which said, hey, only update on change. Something like that, we'll imply that it's there.

[00:24:48]
OK, pretty good. We said we're getting into some diagramming. We did, we got some diagramming in. Any questions, any thoughts? At some point, would you want to move from a single leader replication strategy to maybe like leaderless or multi-leader in order to deal with, because now you have one database that we're dealing with a high amount of writes to that is then having to share that between all the others, right?

[00:25:15]
So availability could become a concern. Yeah, we could. Something we could have done instead of a, we call this a cache, we could have just put it in a database here, but a different database, something that's all writes. We talked about this earlier, and what's an example of something that's more write-heavy than read-heavy? Auto-save, great example, got to be way more writes than there is reads.

[00:25:36]
Cassandra or something, yeah. Yeah, so again, you don't have to know every, you don't have to know every database, the nuances, the pros and cons, just kind of have, what do we say, directionally correct, have an idea and just be able to say, hey, I may switch to a database at some point. Which one would you use? I don't know, something that's optimized for writes. No one's going to be like, oh, what is that?

[00:26:04]
That's happened to me in interviews, they'll ask, oh, what database is that? They're just wondering if I know, but it's not critical that I know, because again, I can look this stuff up really easily. All right, yeah, I have one more question. So I guess with the setup we have right now, the reverse proxy connects directly to the cache. Does that introduce potentially more security concerns? Like, do we have to do data validation both on the web servers and the app, uh, auto-update, sorry, auto-update service?

[00:26:32]
Data validation in what way? More just in the sense that like I'm worried that like if the auto-update service just grabs everything it can from the cache and like somebody puts some bad data in there that could potentially cause problems. I like where you're at, because that's literally the next section we're going to cover, which is we're making a huge assumption here, and that's we didn't authenticate anything, we didn't talk about how to authenticate.

[00:27:07]
We're just assuming that everything's authenticated, but again, nothing's free. There's a cost of doing that. But generally we assume the reverse proxy is kind of doing all that. But when you get under much higher scale, you're going to have a different service that's just your authentication. Jim, yesterday you mentioned reliability too, and when I see like numbers for how often we would save, I think that's like a lever to pull during moments of like peak traffic or seasonal traffic influxes.

[00:27:42]
So if we're saving every 5 seconds, we could actually always save once every second nonstop, or we could back it off to 10 or 15 seconds or something like that. Some line of thinking that would offload stress on something if we needed to. Yeah, yeah. Great, thank you, Sean. We would definitely set this variable in some sort of configuration flag, we call them fast properties, right, where I work. I fear what LaunchDarkly calls them.

[00:00:00]
But something a service like LaunchDarkly, where you can in real time update these variables to change and change your load very, very quickly. It's just a knob we can tweak. Good thinking. They also let you do experiments, like if you have A/B experiments with a feature flag, multivariate testing. And like one of them's going horribly wrong, you can roll people into the other one or test like with percentages of people.

[00:00:00]
Huh, I haven't used LaunchDarkly too much, but I use it a lot. We're going to start selling services here. Next.js, LaunchDarkly, Frontend Masters.

Learn Straight from the Experts Who Shape the Modern Web

250+
In-depth Courses
Industry Leading Experts
24
Learning Paths
Live Interactive Workshops

Get Unlimited Access Now