Transcript from the "Writing a Technical Design Document" Lesson
>> I am actually gonna go through an example, technical document with you, and we're gonna kind of do a thought experiment here on this, and we'll kinda go through it together. So as our first technical document, I'm going to give you guys the problem. We currently have no way to monitor stream output logs from services and other things running in our Kubernetes cluster.
[00:00:21] So basically, we have no logging, all right? Just before I jump into this, what do you think? And I'm curious to you guys, where do you think we would start with this? What would be your next prompt or your next header that you would say after we wrote this problem?
[00:00:41] What would you feel, we don't have logs, where would you start?
>> How do we log?
>> Yeah, that's a good one. How do we do logging currently? That's a great one.
>> Yeah, that's that's a good one, absolutely. That's actually a very, very good one.
[00:00:57] I mean, it could be a lot of different things, but I definitely think starting with what exists is a great way to start. So I actually started with requirements. And I did that mostly because I knew what existed already, which was nothing. [LAUGH] And that's why we don't have it here.
[00:01:15] But it's in the circumstances, if something did exist, you would definitely wanna work with that first. So let's set some requirements for our technical document. We only care about stdio, or stdin, stdout. We have no use for fields or advanced indexing or anything like that. If you've ever worked with logging, you know that that second part literally throws out most of the advanced logging solutions out there.
[00:01:40] Because all we care about is just simple stdin, stdout, or what is coming from our application raw, straight out of it. The second thing, we don't wanna manage infrastructure [LAUGH], right? Second requirement, I don't want to manage this [LAUGH]. You know what I mean? Just writing that down alone can help you guide your decision making quite a bit with regards to that.
[00:02:09] Another one, we only want to solve simple logging for all of our services in Kubernetes, to unblock when we have errors and need to find them. This one simply tells us what we actually care about, right? A lot of monitoring and logging solutions out there will, say, well, you can implement tracing, and you can do real time error handling, and all those things.
[00:02:34] Okay, sure, we totally could do that at some point, but that's gonna require a lot of code change, it's gonna require a lot of other things. And so this is that scoping that I was talking about before where we say, we only want simple logging, we have nothing.
[00:02:50] How about we not try and go to the moon when we don't even know how to build rockets yet. And so that's really what this is here. It's is simply saying, hey, I know this sounds fun, but first, let's focus on just to solving the basic problem. Last one is also massively important, because I think if you go to most places, they want a couple of these things, which is, we don't care about metrics right now, right?
[00:03:17] Most monitoring stocks and things like that will say, cool, you want logs, guess what? We'll give metrics too. That little person on your shoulder telling you that, is enough to distract you and take you down a whole other path that you didn't need to go down. And I've seen this tons of times where we started solving one problem and then we were like, yeah, but it's free [LAUGH].
[00:03:42] Cool, [LAUGH] whatever. It's not free in the sense of engineering resource time. It's not free in the sense of having to learn these new things. It's also not free in the sense of, now we're gonna have to support it, right? And we might not be aware of how free actually it is later on when we're dealing with a bunch of other problems.
[00:04:06] So I'm gonna take a look at this from my perspective. And this is where we start applying a little bit of what you were saying, what do we already have? And we're gonna look at this from two separate angles. The first angle we're gonna look at this is from self host, and we're gonna look at real solutions here.
[00:04:23] So you guys hopefully should be able to recognize some of these terms and things like that. First off, we're gonna look at self hosted, right? Well, we have Loki with Grafana, and it's free basically. We discovered that, it includes most features, right? However, the query language is adoptable and simple, it's not massively extensive in what we can do with it.
[00:04:51] But if we're talking about we just wanna get something up and running, sure, this Grafana and Loki might not be a bad solution. So whenever you compare something, you normally want something to compare it against, which is something I also highly recommend. Don't just look at one solution, look at multiple solutions.
[00:05:11] And so what else do we have out there? Well, we also have Elastic with Kibana, right, the good old fashion ELK stack. Because it's self hosted just like with Loki, it's free, but limited, right? Limited because it's licensed, right? So if we automatically start comparing Loki to Elastic, we kind of start seeing, okay, well, we get most of the stuff with Loki, we can get a lot more with Elastic, but it's licensed, we have to pay for it.
[00:05:43] And the last thing with ELK is that it provides a lot of extensibility, right? It has a lot more functionality and things like that with regards to search, and we like those things, right, but it is unfortunately still behind the license. So now we start talking about cloud hosted, right?
[00:06:00] We wrote in our requirements that we don't wanna have to manage this. So why did we start looking at it with a self hosted solution? Can anyone give me a potential reason why we looked at it from a self-hosting solution? First, okay, the reason for it is because cloud-hosting can sometimes just be way too expensive.
[00:06:21] [LAUGH] And we kind of wanna rule out the things we know we're not gonna wanna do first, right, and potentially even have them as backups, right? If we go to a cloud product and we say, okay, cool, we wanna try this out. And then we literally just explode our credit card [LAUGH].
[00:06:42] We just made a whole other problem and we are stuck now. So it's really important to look at it from, even if you're saying, okay, I'm not gonna do that, consider it. Maybe there's something there you might be missing. And you'll notice that with the cloud hosted it's obviously there's much more options here, right?
[00:07:01] But when we look at it, what are we mostly paying attention to? Costs, right? Grafana cloud, we already said we like Loki, we said that it's nice, it's got an easier user interface and stuff like that. How does that apply in the cloud solution, right? Well, there's a base free of 50 gigs, and then you pay on top of that.
[00:07:21] Okay, cool, we know that Grafana is something we like, we know that we like Loki. This isn't a bad solution still, whether it'd be on cloud or on prem. And then we say, okay, let's look at some other ones. We look at Papertrail, right, another very common the one where we say, okay, well, base is $7 a month for a gig of data.
[00:07:42] Compare that to even Grafana, you kinda go, okay, cool, Grafana is still looking pretty good. [LAUGH] And then we slowly start taking steps away, right, where we say, okay, well, what about Elastic, right? We said we like the ELK stack, they have a cloud hosted solution. What does that look like?
[00:07:59] Well, that's base of $95, and maybe we're still really small. Maybe we don't necessarily need to start just dropping money into places that we don't have to, right? And so in this case, what we're really exemplifying here is the difference between what base looks like versus jumping into something that, we may be not aware of could cost us a lot of money, right?
[00:08:27] Same thing with like Solarwinds Loggly, right? We've got the Papertrail option which starts off cheaper, but Loggly being more of like an enterprise product, it's got a higher price tag because of that. AWS OpenSearch, right? Same thing, it's much more of an advanced product. It gives you the ELK stack stuff, but not top and search now.
[00:08:51] Base $80, 50 gigs, and then you've also gotta pay for cluster and instances, right? And then finally, New Relic, base free, and then 100 gigs. So how do we pick, right? How do we pick, right? We've looked at a bunch of different options and how do we know which is the best for us, right?
[00:09:12] Well, you're gonna go back here, [LAUGH] you're gonna look at what you previously wrote. You're gonna say, okay, well, I liked Loki, I liked what we could do with it. I liked the process or the potential that is there. And I do like Elastic with Kibana, but it's licensed, it's limited.
[00:09:30] We're gonna have to pay money and we may not want to yet. We also said if we go further backwards then, our requirements that we don't wanna have to manage infrastructure. Well, we've kind of seen that cloud hosted is actually very possible. So we can get self hosted completely out of here, we don't have to worry about it.
[00:09:48] We know now that we have a solution that's potentially a decent price, and also technically what we like. Cool, okay. So basically, after you've kind of looked at everything, you've sauced it, and you say, okay, cool. There's gonna be a point where you're gonna start feeling like, okay, I need to write my solution, right?
[00:10:10] And this is hands down the scariest part about writing a technical document because this is where you put your money where your mouth is. [LAUGH] And the best thing I can say with this is, lean on your gut, right? Lean on your experience, your knowledge. And just try and find the best solution possible, even if it's not the one you wanted.
[00:10:28] Because honestly, AWS OpenSearch seems pretty cool, right? So just maybe Elastic cloud, you might even be excited about the solution you're not gonna use [LAUGH] and you have to be prepared for that, right? As a DevOps engineer, you have to be prepared for giving the company the best solution, not you.
[00:10:47] So what do we do? We write a proposed solution, right? And we say looking at the list above, there's only one solution that really stands out to solving our problem, the quickest with the best accuracy. And that's Grafana cloud, right? Why is that Grafana cloud? Well, as we just said, we liked the user interface, we like how it works.
[00:11:05] We don't mind the costs and we know that if we wanna get out of it and something else into the future, we can, right? This is the best of pretty much all worlds. We have very easy move out, easy move in, and not really that big of a vendor lock-in.
[00:11:22] And so actually what's funny about this is, this is a real technical document. I actually took this document as something that I was trying to do because I didn't have any logging for an on-prem solution that we had. And as a new smaller company, as the term goes, buying on a budget.
[00:11:43] [LAUGH] We don't have a lot of money and we want to be able to do a lot of things. And so for us, that's the best scenario here.