Transcript from the "Visualizing Learning Rate & Loss" Lesson

[00:00:00]

>> What I will need to do in my problem, to solve to basically figure out how to heat this red line, into the green line I can minimize the loss function, right? And I can do it several ways, I can just randomly modify w's and b's, by weight and by bias, right, and see what's the current value of the lost function.

[00:00:25] And then just choose the set were w and b gives me the smallest loss function, but it's random guessing, right, and it will take me forever just to go through all of those combinations of w's and b's. So, we need somehow to get to those proper w's and b's by doing something with them and probably is differentiation, right, so that's what neural networks doing during the back propagation.

[00:00:57] They also looking at where we should go, basically how we should change change w's and b's to make our loss function slightly smaller. So, let's implement this, right, so [SOUND] what we'll do is the following. So I'm trying to reproduce right now what neural network is doing during the training and we will be using methods similar to basically.

[00:01:32] So first we need to specify what is our learning_rate. And the learning rate is usually defined, because when you calculate the slope where it should go, you want to go there slowly. Let's say we have some independent parameters, right? In our case, we just simply playing with the w's and b's, right?

[00:02:00] So let's say that's our bias and loss function depends on bias something like this, right, and last function is just the error. So kind of how far away in our case this line we're trying to fit into the distribution, how far it is from all those points, right?

[00:02:26] It's all this average squared distances between the line and as soon as we get to this ideal position, we will be at this point. But we starting from somewhere random, right, so that will be, actually, let me just draw it with a green line, so let's say that's our initial init guess.

[00:02:53] The reason why we want to have learning rate because after you have calculated the gradient, after you have calculated the direction where you need to go. So basically gradient will tell you, okay, it looks like you need to go this way to minimize the loss function, right? If you not apply learning rate what might happen, it might say you need to change your b, so there's delta b, can be too big and you can overshoot.

[00:03:30] Well, we basically going this direction, but delta b can be anything for instance, we can end up being somewhere over here. So that's gonna be our b new. And that's gonna be out next set of b's, and we will do the same thing with w, right? So for w, this loss function might look slightly different.

[00:03:57] So, we also probably started, let's say that was our original w zero. And if we're not accurate with our learning rate, we might, for instance, go this direction but might overshoot and get to w1 somewhere over here. And in worst case scenario, what might happen you might kind of jump in out of your minimum.

[00:04:23] That's not what you want and learning rate allows you to simply go to the direction where we should get to the minimum but not overshoot. So what it's doing is just taking this w, delta b and multiplying by learning rate, so instead of jumping all to here, we'll only jump to let's say here and the next b will be this one.

[00:04:53] And then learning rates simply gonna get us on the next step, this value, and hopefully we'll end up somewhere close. But you can see, for instance, I'm in the distribution right now with this learning rate, where I will be jumping around the minimum but not gonna get to it.

[00:05:13] So there are a couple of tricks for instance, learning rates can be decreased as you go. So you can get to the point where you almost like isolating around minimum points, you're shrinking the learning rates and getting closer to it.