The Hard Parts of Servers & Node.js Processing Data in Batches
Transcript from the "Processing Data in Batches" Lesson
>> Will Sentance: All right, and let's introduce our time going by now, because now it's gonna really matter. So we're at roughly zero milliseconds here, I say. Okay, one, whatever, roughly zero milliseconds. Okay, now, by the way, the stream only starts getting data once we set an on data event.
[00:00:20] It doesn't start grabbing data from there, so I hope it sets up, it goes and finds it, the file. It doesn't actually start pulling data until we say what to do on the moment of that data arriving in in its first batch, otherwise, it would just be pointless to pull it in if we haven't set it to do something.
[00:00:35] So it won't actually even go and pull it in until we say on, and then it does, and the first batch arrives. And I'm gonna do it in blue, I guess, because it's coming from the far system, I suppose, and the first batch arrives.
>> Will Sentance: Yeah, yeah, yeah, yeah, yeah, first batch arrives, and we're going to have the auto-inserted, auto-created and inserted data.
[00:01:19] The first batch arrives of 64 kilobytes, 64 kilobytes of data, and it's like, what was it? It was tweet one, hello, and so forth. Okay, the first batch though, or that's not the closing curly brace, cuz it's a massive JSON object, the first 64 kilobytes, first 64,000 characters.
[00:01:42] In it's come, what's gonna be triggered? The data event, a data message is gonna be streamed out. What function is gonna auto-run at this point, Muhammad? What function is gonna auto-run at this point, Michael? Davit?
>> Davit: Do on new batch.
>> Will Sentance: Do on new batch, excellent, and up it comes, do on new batch.
[00:02:07] And look at this, people, input there it is, it's data, there it comes in the string. We create a new execution context. Are we still in global? There it is, global, we have do on new batch. We have POS, we have it auto-execute. [SOUND] We have it auto-execute, it's gonna auto-execute, there it is, create an execution context for it, there it is.
[00:02:40] Into it we go, in the local memory, we have our first argument. Sorry, our first parameter is data. Our first input argument is the underlying tweet data. I'm not sure I'm doing it in blue now, I guess it's coming from the file system, rather than from n,o but it's kinda packet, I don't know, whatever.
[00:03:06] Now, technically, people, again, it comes in as a stream of zeros and ones, so there's a buffer, but we're just going to, for now, represent it as stringified JSON, as JSON. There it is, tweet one with hello, and a lot of other data up to the end of it, out in the first 64,000 kilobytes.
[00:03:31] We reached this after, what's 64,000 kilobytes gonna take? It's gonna take six milliseconds, is that right? Not even that, one millisecond, about a millisecond, very, very quick. So we're at only one millisecond now, look at this, people. Not after 15 seconds, but after 1 millisecond, we started cleaning data.
[00:03:57] And we do, what do we do? We run, you're gonna run cleanTweets, the function we saved, cleanTweets, take in the data, and return it out, and return it out cleaned of the nasty words and append it with some cleanedTweets. cleanedTweets, is that in local memory?
>> Speaker 3: It's in global.
>> Will Sentance: In global, and that's where we're gonna take the first cleaned batch of JSON formatted data and put it up here into that long string, well, that empty string at that point. Let's make sure we put clean Tweets on the call stack cuz we are running it, cleanTweets.
[00:04:43] Okay, and that took us, I don't know, it took us 0.5 milliseconds, whatever, that's how long it took us to run the cleaning.
>> Will Sentance: No, it took us, these things are a little bit arbitrary, but they are intentional. It was 1.5 milliseconds.
>> Will Sentance: Slightly longer than it took us to get our 64k kilobyte batch in.
[00:05:13] Matt, go ahead.
>> Matt: So with this, we can't necessarily guarantee that it's gonna be the same order, read and cleaned.
>> Will Sentance: Yeah.
>> Matt: The clean tweet isn't necessarily gonna be the same.
>> Will Sentance: Well, maybe it will be, because here's a problem. Well, now, you know what? People, let me just take a step further back before we get the data cleaned out.
[00:05:44] Sorry, everyone, in this execution context where the cleaning is happening, where the cleaning is happening, we're not sure how it's doing it. But it's taking a bit of time to parse all that code, and parse all those strings, and find the bad tweets. While that's happening, you won't believe what's happening down here.
[00:06:11] What do you think's happening, Zep?
>> Zep: Sending the next 64 batch.
>> Will Sentance: The next 64 kilobytes arrived.
>> Zep: Mm-hm.
>> Will Sentance: The next 64 kilobytes arrived in, and it's continuation of the string of data. What event gets triggered down here when the next batch arrives, Davit?
>> Davit: It gets called to-
>> Will Sentance: What event gets triggered?
>> Davit: Do on new batch.
>> Will Sentance: Data gets triggered, exactly, Michael, which triggers?
>> Michael: Data on new batch.
>> Will Sentance: Do a new batch, exactly, and it's ready to run, and folk, I introduced this here but we're gonna it see in greater detail as the last big portion of what we do.
[00:06:56] Now, we have problems. We have a function that was auto-run, and is running another function inside of it. And while that's happening, at about, I don't know, three-quarters of the way through the cleaning of that first batch, that cleanTweets function's running, about three-quarters the way through, uh-oh. Do on new batch wants to run again, and auto-run here.
[00:07:21] Can it auto-run straight away? It would literally have to sorta pause the running of this function. This is Matt's worries. Now, hold on, would it run and finish sooner or?
>> Will Sentance: And this, people, is where we start to need rules for Node for when an auto-run function from a Node is ever even allowed to run.
>> Will Sentance: And we're gonna see this in sophisticated detail for, honestly, more intellectual reasons. There'll be the odd case in reality, well, there's many cases you'll need to understand it, but for senior dev and even mid-level dev Node interviews, it's crucial you understand it. This is where we really need to see it just for our regular day-to-day working with Node.
[00:08:04] I'm gonna start off with a simple explanation, or simplified, no, not simplified, no. Ignoring quite a few pieces of what the rules are for when that auto-run function is going to be allowed back in to run, cuz we've now had it run at, what was it, one millisecond?
[00:08:26] And now it's two milliseconds and it wants to run again, but we're 2 milliseconds, 1 plus 1.5 milliseconds, we're still in here cleaning the first batch. You feeling frustrated, Michael?
>> Michael: Very.
>> Will Sentance: Yeah, Michael is frustrated. Michael is the mayor, I'm not allowed to talk about this. He's a mayor of a major, major sub, no, I'm not gonna talk about, okay, not talking about this.
>> Michael: No, you can, it's fine.
>> Will Sentance: He's proud, he's a very great mayor, it's an honor to have him here. And he's frustrated, so you can only imagine what the non-mayors among us are feeling, powerless! Powerless, don't panic, don't panic. Turns out with actually the help of Node, and particularly the libuv library behind the scenes, so I'm actually gonna do it in purple.
[00:09:59] We call it the callback queue, I'm gonna be more specific on this later on, cuz there's quite a few of them, but for now, we're gonna say the callback queue. And so at two milliseconds, when the next butch arrives, triggers out the data event and says run do on new batch again, two milliseconds, we do not take do on new batch and start auto-running it.
[00:10:27] Where does it actually go, Michael?
>> Michael: Callback?
>> Will Sentance: Into the callback queue, exactly, into the callback queue is where it's gonna go, and it's going to have to wait there. Do on new batch stored there while we continue running code here. So let's finish up running the code.
[00:10:55] Clean data does come out and gets appended to what global variable, Sam?
>> Sam: cleanedTweets.
>> Will Sentance: cleanedTweets, excellent, and it's got, I don't know, still stringified. We didn't JSON parse it yet, we're just leaving as a script. So we gotta append the next piece of the string to JSON, parse it in one go when we've done that.
[00:11:14] We'll do that at the very, very end when all the data's in, but we're doing the cleaning as we go, cuz that's the bit that takes a long time. And into it is tweet one, hello, and so forth, all the data there. And we've executed cleanTweets, the function we were running.
[00:11:31] And so it gets what, Zep, in terms of the call stack?
>> Zep: Popped.
>> Will Sentance: Popped off the call stack, exactly, and we go back into the running of do on new batch. Well, there's nothing left to do in it, so it's also?
>> Speaker 3: Get popped.
>> Will Sentance: Popped off the call stack, and now, we're back to global, okay?