# A Practical Guide to Machine Learning with TensorFlow 2.0 & Keras Learning Rate & Gradients

Transcript from the "Learning Rate & Gradients" Lesson

[00:00:00]

>> This trick will not give you the exact solution. So it's not gonna get you to the analytical solution of those minimums. But its iterative method which gonna get you pretty close to the optimal solution. And that's what we'll try to implement right now. So with learning rates, we need to specify how many steps we'll want to do, all right?

[00:00:26] So steps, let's say 200 steps for now. And remember, I was getting to the minimum, and that's how many kinda steps I want to do to get to the solution. And now, I just need to implement for loop. Actually, let's use plural steps for step in range steps.

[00:00:57] In our steps, what I will do, I will simply calculate the current loss function, right? And then, we'll decide where I need to go. Yes, and I need derivatives, basically, to figure out what is the direction. Should I go up? Should I increase my b? Should I decrease it, right?

[00:01:18] And the trick of TensorFlow is that all of those derivatives can be calculated for you automatically using something called GradientTape. So let's actually implement it here. So with TensorFlow GradientTape As tape, for instance, what we can do, I executed the code accidentally. So it basically creates this environment where TensorFlow now is keeping track of arithmetical operations, which happening in this block of code.

[00:02:01] So what we can do, we can then type something like this. We have TensorFlow GradientTape, just do the predictions. And predictions, yeah, we'll be using our predict function on X. And calculate the loss function. So loss function, it's our mean squared error, and we comparing our predictions with true Y values.

[00:02:37] All right, so that's what we will be doing in our steps. We will just simply look at the calculating the predictions, right, for our original data points axis. And then, calculating the loss function. But what is interesting next is that we can automatically calculate the derivatives from this tape.

[00:03:04] And to do this, if we just type, for instance, gradients, tape gradients. And we need to specify function, loss function. And then, list of variables which we plan to modify. So in our case it will be w and b. So w and b. So we need to first tell the TensorFlow that we have those w and b.

[00:03:45] So we can say w equal to TensorFlow constant. And initialize it with something, for instance, 0.0, right? And then, b will be also our constant initialized with, I don't know, minus 1, 0. Doesn't really matter, because we will be modifying those values, right? So now, TensorFlow knows that we have those w's and b's, and in our GradientTape, it will keep track of those.

[00:04:19] And also, what I need to do in my function, predict. Yeah, I need to or specify w, for instance, and b. So I'm switching from just renaming the variables. Now, we have those w's and b's initialized with those can kinda random values, and we will keep track of this command.

[00:04:48] So we're doing the prediction for our points x and We can even specify w and b here, all right? And then, calculating the loss function. And we want to return gradients, so kinda those partial derivatives, values, of the loss function to those variables, w and b. And now, since we know the gradient, basically, you can think about it like the direction where we should go to modify our w's and b's to get to this minimum.

[00:05:30] And we can simply modify w's by running also another TensorFlow command assign_sub. Assign_sub will just take current value of w and subtract the argument provided to this function. And the argument in this case will be gradients 0. So gradient will be the array of two values, gradient for w and gradient for b.

[00:06:04] So I'm just simply referring to the first index of this array, gradients. And for b, I need to do the same thing. But just use the next element of gradients. So remember, I said that we also need learning rates, right? Not to overshoot. So I simply can multiply our gradients by learning rate right here.

[00:06:39] And that should help us to simply, not minimize, but just take only part of those calculated gradients, and slowly get to the point. So there's a trick here. If you take a learning rate too small, you will be barely changing your w's and b's, right? And instead of 200 steps, you might gonna need two millions steps.

[00:07:06] At the same time, if you're learning rate is too large, you might actually jump out of your minimum completely. So learning rate in this case, and determinant I'm trying to introduce, is hyperparameter. So in this case, learning rate is hyperparameter. It somehow controls how the training is happening.

[00:07:27] And in machine learning and deep learning, there are usually a lot of those learning rate hyperparameters. So it basically will control how the training is happening. All right, so with all of that, I think we're ready to run our command. The only thing I want to probably just print out our current status.

[00:07:51] So let's say if step divided by, let's say, 20 equals 2. I mean, if the reminder of dividing step by 20 equals to 0, then I will do print, I don't know, step. Yeah, format. And we can do step right here. All right, let's try to execute. Yes, it's not an assignment, it's comparison.

[00:08:37] All right, gradient object has no attribute gradient. Because it is gradient, not gradients. Assign variable. Okay okay. Another mistake. Constant will not be modifiable, so I want to use actually variable instead. All right, so steps up to 200, perfect. So we actually did something, right? So in our code, we were just modifying those w's and b's.

[00:09:20] How well we did, I have no idea. Let's check. So we actually have those w as close to 1.4, let's say. And what about b? b is 0.47. Remember, I specified my original w as something close to 0.1, b equal to 0.5. So we kinda there. Probably to even visualize it, we can use those graphs.

[00:09:56] But I will need to, yeah, do something. Let's call them tru. And specify the true values, the one we used to initiate our data, right? And I should change it here as well. And that's gonna be our blue line, or actually green line. And that's our guess line.

[00:10:39] Just red color, right? So if we plot this right now, you can see previously it was basically down low, right? Because our, well, actually we started with the w equal to 0, b equal to minus 1. But minus 1, it means that it was actually sitting below the 0 line at minus 1.

[00:11:07] And it's basically shifted up, and also tilted, to get closer to the true values of b and w. And the thing is, we can continue this training process even more. So for instance, if I, yeah, I should probably comment out those w's and b's just to use previously already calculated values.

[00:11:30] But now, if I just rerun this block again, and you can see, it's running pretty fast. Previously, w was 139. If I re-execute it, it's now getting even closer, right? And the same for b, it's getting closer. And that was the plot after first 200 steps. Now, it's really close.

[00:11:54] But maybe we at this point in the situation where we kinda jumping around the minimum, right? And not getting to the exact. So there's still a little bit of a difference between red line and green line. If I change my learning rate to something smaller, that will allows me to get even closer.

[00:12:15] But basically, this example of linear regression, the reason why I introduced it, I wanted to introduce a couple of concepts, like lost function. Loss function is simply the measurement of the error. And back to our example with neural network, it should be the measurement of if you did the prediction correctly or not, for instance.

[00:12:41] If you're looking at the hot dog or saying it's not hot dog, our loss function should return pretty big value, right? We are very, very mistaken. And then, you can just calculate derivatives of this loss function to all the weights we have in our neural network. And we will know how those weights and biases should be modified to actually get you the correct answer.

[00:13:12] So it's kind of the same algorithm, but on significantly larger scale, and when you're operating with thousands of different weights and biases. But the common principle is exactly the same. You're just providing the input data, what you expecting right? And you keeping track of what is happening and calculating how weights and biases should be tweaked to actually get to the correct results, something similar will be provided as the input.

[00:13:44] Yeah, as simple as that. [LAUGH] maybe not so simple. Okay, next, we gonna be playing actually with pictures, and that probably was the heaviest math part, once again. So now, it's gonna be all about just simply writing code which will take the image as the input and print something funny at the end.