
Lesson Description
The "Challenges in Creating a Production Model" Lesson is part of the full, Hard Parts of AI: Neural Networks course featured in this preview video. Here's what you'd learn in this lesson:
Will summarizes the fraud prediction model and highlights the challenges of moving the model into a production environment. These challenges include training it on a larger data set, end-to-end deployment strategies, and ongoing model performance measurements.
Transcript from the "Challenges in Creating a Production Model" Lesson
[00:00:00]
>> Will Sentance: Getting our fraud detection model. This is the fraud detection model. A set of FL essentially, to production, live on the Internet. We have two major parts to this. One, we've been doing essentially here on the whiteboard, on the blackboard for a while. And maybe a data scientist would actually get on the blackboard and do it like this.
[00:00:18]
Data exploration and prototype our model and model pro, okay. That's a good timing, and prototype our model. Often, data scientist-led. Smaller sample, here we had 100,000, well, actually here we had 4, would be about 10 to 20% of the final data set that we ultimately use to develop our production version of this converter.
[00:00:38]
That we can then use on all future refund requests. Maybe it takes only a couple of weeks of data. Because a data scientist is going to want to, as Ben's saying, play with a number of different approaches, explore them. They don't wanna get into perfection every edge-case handle, they wanna play with a number of different approaches.
[00:01:02]
The goal of this piece, understand the problem and the nature of the data. So what are some of the pieces involved in that? Well, a lot of pre-processing. Pre-processing might be, for example, I don't know, it could be rounding the dollar amount to a whole number, rather than carrying about the cents.
[00:01:24]
It could be categorizing the types of items, we heard Kayla say, type of items. It could be categorizing the types of items into, not just every single individual title, but categorizing them into, I don't know, like, perishables versus not. Cuz I'm a lot happier to get a refund on something I get to keep forever than just on a, so categorizing, lots of things you could do to pre-process, including, filling in missing data.
[00:02:00]
Maybe for some, older data, or we don't have restaurant or we don't have, I don't know, we don't have information on the council IP, the IP is masked, or whatever it might be. There's lots of, that will prevent us running our converter if we don't have any data.
[00:02:21]
Huge part, feature selection, pruning, engineering. So, feature selection was, well, which features did we select? I'm gonna call on someone who's not looking right now. You're all looking. Kayla, which features did we select? Well, you selected what?
>> Kayla: The account links.
>> Will Sentance: Account links?
>> Kayla: And the amount of the refund, and the number of-
[00:02:48]
>> Will Sentance: Accounting IPs. Accounting IP, yeah, exactly. And then which one do we prune? Well, I guess we didn't really use any of these, but pruning would be as we developed our converter. Maybe we learned that one of these was not really important. It wasn't a very useful thing to take into account.
[00:03:02]
But I could also say we pruned out our, they were expecting to use our date, we got rid of our date, didn't we? We didn't use our date in the end. And feature engineering. This work, kind of, to be fair, by the way, categorizing the types of items might be feature engineering, but also amount, let's say, amount in relation to account length.
[00:03:23]
Maybe that's an important factor. Or maybe account length versus accounts on IP. Maybe if you've got a lot of accounts on an IP right at the same time as, and that might be captured by the conversion rules, but it might not be. Maybe you're interested in accounts on IP per year of account length.
[00:03:41]
Maybe you're much more worried about a 150 accounts on a very, very new account, a very, so a 150,000 IP on a very new account, then you would be on one that's 6, 7 years old. Or maybe it's the change in the number of accounts on IP in the past week.
[00:03:57]
Maybe that's a way of capturing whether it's, I don't know, a building that has many people in it, a dorm or a co-working space, versus someone who's just suddenly spun up a bot farm. These are the kind of factors that you would engineer. You would turn accounts on IP into accounts over time.
[00:04:21]
You would do some calculations, some engineering, some playing with the input values. That's our feature, engineering. Fixing imbalanced data, a big challenge for fraud detection models in general, and we're speaking here to more broadly, models like this, but this one particularly has a problem with imbalanced data. That is to say that 99% of the refund requests are legit, and were given a refund, and 1% wasn't.
[00:04:48]
Meaning in 100,000 refund requests, actually 99,000 of them were in one category, and only 1,000 were in the other category. You actually don't have much data on those that were not given refunds. You actually got a little bit, it makes the techniques that are used to identify the right boundaries less effective when you've got such imbalanced data.
[00:05:13]
It's quite an extreme, 99,000 of these were not fraud, were given a refund, only 1,000 weren't. The techniques to find the best conversion boundaries, like Gini impurity metrics do not work so easily on such imbalanced data. So it is not to go into the details of it but rather to say, finding techniques to overcome these is a crucial part of this stage of the model development.
[00:05:43]
Things like, synthetic minority over sampling, or SMOTE, S-M-O-T-E. So, synthetic minority over sampling technique, SMOTE, would create some refund-not-given sample for helping develop our converter more effectively. And then interpretability, we already talked about this. Do we need to prioritize understanding how the decisions are made? In this case, our decision tree gives us boundaries.
[00:06:18]
And then, experiment with different approaches. We used a single converter here, in practice, we very much might experiment with ensemble learning method. So that's combining up multiple models in this case to see which one works better. Note that, particularly for a decision tree, if we start by first making sure we classify as many as possible based on their years, then dollars, then accounts on IP, and then 10 more factors.
[00:06:54]
It's what's called a greedy algorithm. It might well be, it maximizes the class, putting them in the right bucket 1 or -1 for years, and then maximizes for dollars. But what if there's some interplay between these? You might actually not get to the very best set of boundaries for your entire sample.
[00:07:11]
If you'd started with accounts on IP and then moved back and got each of them optimally the other direction, you might have ended up with a different, better set of boundaries that convert more of your sample. Ensemble learning techniques will allow you to, and this, again, just many, many, many approaches.
[00:07:29]
My point here is, wow, there's a heck of a lot of the stuff you can do here, it's obviously a huge discipline. But an ensemble learning technique, for example, random forests, might allow us to try different decision trees, follow different directions, and maybe get to a globally, maximally, best conversion.
[00:07:47]
As opposed to only the one that, if you follow this route, start here, start here, then here, gets you to only a locally optimal conversion. And then evaluation metrics. So, assess how well it did. Well, we got 4/4 and 1/1. But actually in fraud detection models, that may not be very good.
[00:08:08]
You could get 99%, I love this, you can get 99% conversion for your sample of fraud detection that has 99,000 not fraud and 1,000 fraud, didn't give refund. You could give 99% conversion accuracy, and actually you could do that by just saying, every single case is not fraud.
[00:08:29]
Every single case gets a refund. You'd be 99% right for the 99,000 that got a refund. It just happens that the 1,000 that didn't were the very ones that did the fraud. So for fraud detection algorithms, or fraud detection models, you care also about things like precision and recall.
[00:08:48]
Which is, of the ones that were actually fraud, how many were converted? Of this one, we could get 3 out of 4. Let's say we had 99 out of 100 and this was the one that we got wrong, that's the bloody one that's a fraud. That's the one you need to get right.
[00:09:02]
So metrics like evaluation, metrics like precision, which is of the ones that were actually fraud, or labeled so, how many did they get right? How many did the model get right of those? And there's many metrics like this. And all of it can then be maintained and tracked and the experiments, very exploratory, trying different approaches, but you want to keep track of it, using a tool like MLflow.
[00:09:27]
Which actually, shout out to, which is the default framework for tracking and maintaining your experiments and your evaluation metrics. And shout out to Codesmith team, who just announced MLflow JS, a JavaScript, at least it hooks into MLflow, JavaScript interface for ML Flow. Next part, produce that for production.
[00:09:54]
You gotta get this bloody converter online so you can get about using it to infer live, shall we give Paul this refund that they just clicked the Ask button for? How do we do it? This is often ML engineer-led. None of this is exclusive, these terms are kind of very, very overly expansive.
[00:10:19]
And that was, by the way, similar to software engineering for many years, not very well defined terms. You had system engineers, you had technical, you had a lot of terms that all essentially have become software engineer. Same thing here, they're a bit vague and it's evolving so fast, but the bit of getting it live online at scale, typically movie ML engineer.
[00:10:42]
Here you're now caring about expanding your sample from maybe 100,000 to get the best conversion. That the prototyping data scientist said, look, we roughly there, to working on the full data set. Because, a good example would be, suppose your data scientist is working with 100,000 to be able to experiment rapidly, rapidly prototype, they may only take a month of data.
[00:11:06]
Great, most of the patterns are gonna be independent of the time of the year. But if you wanna take that model to production, there may be subtle differences between December and February. In terms of, you need to be able to make predictions for every edge case. It doesn't change the fundamental model implementation, but you do need to then ensure you find conversions that capture every damn piece across a larger data set, a larger set of edge cases.
[00:11:35]
So, maybe a million refunds. So we'd expand this to, not a good sign when I'm not being right on the whiteboard. Expand this to make maybe a million. We'd use techniques to come up with the converter at scale, like extreme gradient boosting within decision trees, developed both a principle.
[00:11:54]
Only again, this stuff is so new, it was developed by somebody who I think, Chen, I think that's right. An engineer or a researcher then at, I wanna say Washington, U Washington, and now, I think at Carnegie Mellon. This is one of the most popular, both techniques and then implemented libraries for handling, developing the best decision tree at scale, the optimal set of boundaries at scale.
[00:12:26]
Introducing a paper in, I think 2016, this stuff is always evolving like now. And yet now it's the standard for working on large data sets where you wanna get decision trees at scale. Where you're dealing with a million refund requests. Where you wanna find across many, many parameter values, the optimal, the best set of parameter values.
[00:12:49]
So we could use things, I'd say, like XGBoost to come up with our our best boundaries. End-to-end deployment, all of the stuff we did here needs to be put on the Internet so that it can be part of your company's workflows. So for example, our converter here that we're gonna use on pool will build an API likely for.
[00:13:12]
And I can't squeeze it in there, but we would have, let's put it here. We would have some, I don't know, API, let's say doordash.com/models/refund, that would expect years, it would expect dollar, look at this object, it's very, it's JSON, it's very accurate. And it would expect accounts on IP.
[00:13:43]
And then the return might be 1, give a refund, or -1, don't. We'll deploy this model, and the deployed trained, shit, we didn't use the term training. Dear. All right, bonus, the stage that we used our sample to come up with our model, our converter, the right boundaries for our sample, what's that stage called, the?
[00:14:11]
>> Speaker 3: Training.
>> Will Sentance: Yeah, okay, good. Yeah, training, that fancy word. Raise your hand if you've heard that word in, yeah, that's all it is, using our sample to come up with our conversion. That's our training stage. When we were training our model and then validating it with our sample data, yeah, we came up with our model, which we're then using to make a prediction inference stage.
[00:14:38]
Yeah, the model we trained needs to, yeah, yeah, sorry, I was wondering why I kept the word training. The trained model is this. The best boundaries for that sample. That's the trained model. That's it. The deployed trained model is, [LAUGH] take in, and that's why a trained model, by the way, can often be stored as a Python object, a Python dictionary.
[00:15:04]
Because it's just numbers. A trained model, yeah, it's wonderfully touchable set of numbers. Set of boundaries in the case of decision tree. So when we deploy it, we're saying, take in an input and return out an output. That's our deployed model. And then finally, a crucial part, we already touched on it, ongoing performance of the model.
[00:15:36]
Checking when Paul comes along and tests the limits of the prediction. Not only bad actor, but he's clever, smart, brilliant, boundary testing. Paul is great information to refine the model, to improve the sample, which has now got out of kilter with the population as a whole. The population now has people like Paul that are not captured in our sample, and in therefor, our converter.
[00:16:15]
So to remove and reduce that disconnect or drift, we can update the sample with, let's say, somebody is regularly checking for outliers, like the number of IPs, number of counts on the IP that Pull has. And are correcting. So the refund was given, the correction was, do not give.
[00:16:39]
And that is then reintegrated back into the sample. And now we have 1 million plus, let's say 1,000 of those a week. Are we going to then take 1 million and 1,000 and redevelop our conversion boundaries from scratch trying out all of the different boundaries? Well, actually, for decision tree, there are fewer techniques to tune, but there are still techniques to do what's called tune our converter, adjust our model without a full retraining with all of our sample.
[00:17:09]
That tuning stage of adding in new data that tweaks the model, that makes the converter values work for not just this sample, not just our million sample, but also the updated corrected pool refund. That's called tuning or avoiding retraining, or optimizing the retraining of our model to prevent that drift, that disconnect between the sample and the population.
Learn Straight from the Experts Who Shape the Modern Web
- In-depth Courses
- Industry Leading Experts
- Learning Paths
- Live Interactive Workshops