Transcript from the "Learning Rate vs Steps" Lesson

[00:00:00]

>> Does the learning rate make a difference in how long it takes for your model to train if you have, let's say, large, large, large data sets?

>> Yeah, so right now, those two independent, my current learning rates and number of steps. So how long the algorithm will be running depends on number of steps.

[00:00:28] But learning rate defines if we gonna get to the solution or not at all. So just to demonstrate, let's say we initialize with those variables, 0 and minus 1, right? But let's say our learning rate is really, really small. So what will happen if I just re-execute this code, right?

[00:00:49] So we did 200 steps, but if we look at the w, originally it was 0.0 right? And our b was minus 1. So if we look after those 200 steps, our w is actually changed and our b modified. And it's kind of going the right direction, but it's really, really far from the true values just because our learning rate is so small.

[00:01:19] If I run this code for five billion trillion [LAUGH] iterations, it will get to the solution or at least in the proximity of the solution, right? The absolute true value is not guaranteed. But this is really a simple problem, you can probably just solve it analytically. But anyway, that's just to demonstrate what's happening with this, it has degraded descent.

[00:01:46] You're also starting with some random point, and just calculating what should be the direction you need to go. And you also can control your learning rates and how fast you're going, and you might overshoot, right? So for instance, right now, if I set learning rate equal to 1, why not, right?

[00:02:08] Well, 10. If we're overshooting, let's overshoot big. So if we execute this code, so we just ran 200 steps again, let's just see, we actually over jump from the dynamic range. So we got so huge number is has become none. [LAUGH] I know. Let's probably get learning rate equal to 1.

[00:02:33] I'm not sure what's gonna happen, but we'll see in a second. Still none, okay. So it should be probably something 0.9. And that's the thing, to some degree, yeah, it's this negative number in 10, 24th, it's huge. And just because, yes, for our function, loss function is basically your hyperbola.

[00:03:11] It's hyperbolic function, right? And what's happening with those learning rates, you just jumping out of. And after 200 steps, yeah, that's why you seen this 10 and 24th power, just because you moving not closer to the minimum, but actually out. You choosing the right direction, you just over jumping significantly.

[00:03:36] So back to our original question. With the number of steps, you control how many calculations you will do. But learning rate control how fast or how slow you're gonna progress. So it's combination of both. To get to the solution, quite often you should start with some large learning rate.

[00:03:54] And there is a method called one shot learning. It basically tries different learning rates, right, starting with something really, really small. So basically, barely moving. But then, as you progress, it's seeing that loss function starts dropping significantly, meaning that you getting to the solution pretty fast. But you continue increasing the learning rates to even start jumping outside of the solution.

[00:04:28] But this simple exploration will give you approximately optimal value for the learning rate. So when you see the minimum in the loss function, if you just take this value, which you used for the learning rate and divided by 10, it should give you an optimal learning rate value.

[00:04:47] It basically means that in 10 steps, you will get kinda to the minimum. Makes sense?

>> So it gives you the optimal learning rate for the number of specified steps?

>> Learning rate depends on the problem, but you should consider them as independent hyper parameters. But you need combination of both to get to the optimal solution.

[00:05:19]

>> So in this case, since you've increased the learning rate to 0.9, what would you have to do with the number of steps to get you closer to-

>> Nothing. That's the critical [CROSSTALK] it will never help. Well, maybe, okay, maybe reduce the number of steps to 1.

[00:05:35] [LAUGH] Let's see if it's help or not. So if we run this code, let's see the value. 1.4, 1.7, okay. It should be 0.1 and 0.5. So if we increase to 2, so 2 steps, okay, it was 1.4, it became 0.6. So we simply kinda jump this way first.

[00:06:02] Now, we jumping that way. And the more steps I will be doing, the far I will be. Remember, it was 1.4, now it's 2. Yep, now it's 3. It was 1 something previously, I think. So it's just basically this jumping out of your well. So nothing, nothing will help if your learning rate is too big.

[00:06:28] That's why they recommend to start with smaller. But as I said, this method I think I heard about it from author of Fast AI, Jeremy Hulbert. Jeremy created the whole deep learning course and framework with the same name, Fast dot AI. And he used this method quite a bit.

[00:06:56] So basically, you're starting with learning rate pretty small, but on every step, you just increase it a little bit. And you're doing that to just find the optimal learning rate you should use for your problem. By plotting a learning rate versus loss function. Let's take a closer look at what we discovered so far.

[00:07:20] Actually, this notebook was already prepared with all the bits in it. So if you want to play with it on your own, you can just execute this code. So it's exactly what we did generating the noisy data, right? Plotting the distribution, playing around a little bit with other types of visualization.

[00:07:40] And once again, kinda creating variables, the one we will modify additional functions. So I just wanted to provide this example to demonstrate build the concept of loss function. So it's a function which is supposed to represent a how mistaken you are, and what you can do to imrove your mistakes, right?

[00:08:02] And what we're doing, we're just keeping track of all the operations we did while calculating those predictions and calculating loss functions. And then, just using gradients relative to those variables, which will indicate how the variables themselves have to be changed to get to the local of this loss function.

[00:08:25] All right, so if you want really cool visualization, it's actually at the end of this notebook. I've not only plot 2D image, but the whole 3D right here. Right here. We should demonstrate how our loss function looks in 3D, so that's actually the image I'm referring. So we have weight as 1.

[00:08:51] One axis biases the other, and this whole thing almost look like a saddle. And we're just getting to the minimum by just trying all those points and slowly getting there.