Transcript from the "Neural Network Attention" Lesson
>> Now let's talk about where the convolution neural networks is looking, where the making a prediction. And I will actually introduce another concept of transfer learning. So with the transfer learning what you can do, you can simply grab already trained models and load them in your notebooks. So skipping all the training parts, somebody else already did that job for you and probably used tons of GPU compute, just the training module.
[00:00:32] And you can simply download it explicitly for instance by using karas applications, VGG16 it's specifically one of the architectures. So let me quickly load it. And we can probably even try to say something like model summary, to find out more about this architecture. So you can see it's actually taking 24 by 24 images with three channels, RGB, right?
[00:01:05] And doing a lot of convolutions, so we're actually doing two convolutions in the sequence and only after that we're doing max pooling. And you see that there are a lot of convolutions happening here and when we're reducing everything, almost to 7 by 7 filters, right? And only after that flattened image and our last two fully connected layers.
[00:01:33] We'll reduce everything to those thousands neurons which corresponds to the classes. Remember, in image net I said that we're trying to recognize between 1,000 different objects. So that's why we need 1000 neurons at the lowest layer. But when we initialize that's when we grab this VGG class from Kerris applications.
[00:01:59] We also can say that I want my model to load not the random weights, I won't actually grab the weights which have been trained on image net. So basically, I will already by having this small, I can do prediction directly. I just need to provide image 224 by 224 pixels and color image, and that's what I will be showing you how to do next.
[00:02:28] So with this command, I'm just simply downloading image of elephant. I can even visualize it here if you want, and they can use mark down directly for this. So with the mark down to visualize image allophones I need just to say, read if out [LAUGH]. It's not actually after completing in mark down but I can just copy/paste the name of the file, I've just downloaded.
[00:03:01] And that's the image we will be processing. Just beautiful picture of two elephants under creative commons license, which is good for me. And what we'll do first we need to change the resolution of our image, right? So that's what I'm doing here, image load image, I'm just loading it's using image from Keras pre processing, right?
[00:03:31] So basically, these class on their tensorflow Keras pre processing image, is got its own function load image from a particular location, right? From the file, store. And it can actually changed the resolution of the image on the fly while loading it so we converted it to 224 by 224.
[00:03:56] Right here, I'm just converting image to the array and adding additional array. Once again, our model assume that we will be providing multiple images as the input, right? But if I want just to provide one, I still need to create a list with only one image inside so that's what I'm doing here.
[00:04:18] And preprocess input is probably just doing the, I'm actually not exactly sure what is the doing, preprocessing, preprocessing, I don't think it was defined anywhere. The main part of this code is just simply converting our image to this resolution, preprocesin input is part of the VGG. So it's basically doing conversion to whatever this model expect the preprocesin to the image to be done.
[00:04:50] And it's now we can actually do the predictions. So in our x if we just print the x, no x. So you can see it's just the array with each of those three numbers, that's the RGB channel of each individual pixel, and I have tons of them. So what they can do I can just say model do the prediction and provide this list of continue of our image.
[00:05:22] And print out what our model predicted. So it predicted with 90%, almost 91% accuracy that we're looking at African elephant. ABout 8%, almost 9% it's a Tusker, and it's India elephant about 0.4%. Pretty accurate prediction, but what is interesting is next what I gonna do. So remember, we're choosing between a 1,000 classes, right?
[00:05:52] And then it kinda curious what was the number of the neuron activated, which correspond to African elephant. And I can simply just look at the index of this neuron and it was 386,000, right? So why am I caring about this index? Well, it's basically what I will be doing next, I will try to look at the last convolutional layer.
[00:06:30] And we'll simply try to find all the activated connections, which activated this neuron, 300 whatever the number was, 386, right? I can just see which of the location of the neurons which activated that last neuron. It basically allows me to do the predictions which area of the image my neuron network was looking at when doing the prediction that we looking at the elephant.
[00:07:03] So to some degree I'm trying to visualize where my narrow convolution neural network is paying attention while doing the prediction. So I said that, neuron networks are black boxes, right? And the way it weights and biases are adjusted, it is a mystery, to some degree it is and a lot depends on the initial way distributions, right?
[00:07:25] And for instance, if you initialize your model with one random numbers, but then reinitialize them. You might get completely different numbers of weights and biases, right? But with this thing, at least I can visualize or get some understanding where the neural network is looking at from the making a prediction.
[00:07:47] And that's that's what I'm doing here. I'm just grabbing the last convolutional layer so block5_conv3, if we look at the topology, it's right before our fully connected neuron networks. Before the MaxPooling, Block5_conv3, that's our lost convolutional layer. And again, What I can see, I can see at all the gradients which initialized my African elephant output.
[00:08:24] So model output, that's our last layer, right? So and we just looking at this last neuron, and we can find all of those activations where they happen. So, it's kinda just looking at those filters and in our last convolutional layer, it was 7 by 7 pixels. That was the result of all the previous convolutions and we can see which of those particular 7 by 7 pixels were the most active when they did the prediction.
[00:08:57] That we're looking at the elephant. And I can build the, I'm not sure if I already executed this block of code, so let's just run it once again. Yeah, I can visualize the heatmap now. So heatmap just saying here that this region was the one which resulted into the prediction that I'm looking at the elephant.
[00:09:25] And it doesn't really give me a lot of useful information. So I can actually overlay this heatmap on top of the original image to actually visualize what is happening. So apparently the prediction was done because we're looking at the heads of the elephant and Tosca yes of the mother elephant.
[00:09:47] [LAUGH] And the thing is, you can play with this pretty much with any picture, do you want to try something, you just for fun?
>> I have a question.
>> So I know that you said that we already trained the model like on elephants, I assume.
>> And then it was expecting an array of at least three different assume elephants, some sort of images. Why did we do the one image in this example? Was it just for ease in this specific case? Sorry, why didn't we do three creative commons, different elephants?
>> So I think you asking about the prediction when we done that there is African wlephant, there's tusker and Indian elephant.
>> Towards the beginning, your input to.
>> My input, you mean just this image?
>> Yeah. You said that it was expecting an array of at least three different images.
>> And then, we just copied the image.
>> So, the way how our model usually trained with those VGG systems they using mini batches. So, if you look at the input layer, dimensionality is the following, none which just correspond to different number from 1 to as many as you want images.
[00:11:16] And those images all supposed to have 224 by 224 with 3 channels. So, just because we have this non it means that our model expecting the list of images. And the reason why this is common and you will see it with usually pre trained models is they using technique called mini batching training.
[00:11:41] And basically, when you're calculating the gradients, kind of how you need to change those weights and biases. You're not doing that only by looking at the single image, but you look at multiple images and calculating what should be the average change of your weights and biases to avoid kind of noisy change of your weights and biases.
[00:12:09] So just can reduce the noisiness of those changes and get more stability of the training. We introducing those mini batches and there is another benefits when you do a mini batch training. It can actually utilize vectorization, it means that you can use CMD instructions which you can skip explain.
[00:12:32] [LAUGH] But it can significantly improve the performance and when I'm talking about performance in this case. It's not the accuracy of the model, but it's just computational performance. Just because you can do the operations on individual images simultaneously in parallel with those CMD instructions. So mini batching is a good way to get to a more general solution, right?
[00:13:00] Just because you avoid too many noises, because let's say you provide only one image of elephant when you train, right? And your whole model will kind of move toward that recognizing one elephant, right? But what about other 999 classes? We most probably gonna decrease the weight which correspond to those classes.
[00:13:23] So to avoid this type of problem, we're processing several images simultaneously. And mini batch exercise can be anything from 16 images to thousand images simultaneously. And it will simply look at what's the best way to satisfy all of those classes. And mini batching is highly recommended if you have a lot of classes like in this case, instead of 10, digit, 10 classes we have thousands.
[00:13:48] So to have, get a better distribution of weights and bias modification, it's basically that's why we're using mini batches. And we need to provide list of images to our model when we training and also when we predicted