Check out a free preview of the full A Practical Guide to Machine Learning with TensorFlow 2.0 & Keras course:
The "Softmax" Lesson is part of the full, A Practical Guide to Machine Learning with TensorFlow 2.0 & Keras course featured in this preview video. Here's what you'd learn in this lesson:

## Vadim explains that the softmax function has an output equal to one. Normalizing the sum of output to one permits to look into neurons as probabilities.

Get Unlimited Access Now

Transcript from the "Softmax" Lesson

[00:00:00]
>> Let's keep our models simple for now. So basically, what we did so far, we flatten our image and then we had those 256 neurons. And how can we get the results? Well, we need to add output layer, which can be dense as well. And let's see, since we have 10 digits, the one we're trying to recognize, we can say that we want to have 10 neurons, and activation function for them.

[00:00:36] So basically, what I'm doing right now, I'm saying that all my 256 neurons from the previous layer should be connected to all 10 units or 10 neurons in my last layer. And each of those 10 neurons should print out something, so activation function for them can be Softmax for instance, yeah.

[00:01:01] So softmax, it's kinda similar to sigmoid, so it gives you the distribution, 10 numbers between zero and one. But if you have multiple units in the last layer, what softmax is doing, it's simply normalise all the outputs of those 10 units to one. So let me probably just repeat it once again through the diagram.

[00:01:26] So if you have sigmoid, we can get any numbers between zero and one, right? So that's the distribution of outputs. So sigmoid output will be somewhere between zero and one. With the softmax, it's also between zero and one. But if you have several units like this, it will guarantee that the sum of all of those outputs will be equal to one.

[00:02:15] Why it is useful? Well, if we normalize that to one, it means that now we can look at the outputs from those neurons as probabilities of having digits zero, one, two, three, and so on and so forth, till nine. And basically, that's can be considered almost like your one hot encoding.

[00:02:42] So where we have the maximum signal, so for instance, let's say we were providing zero handwritten digits to our model. And in this case, probably when the model will be trained, we'll get something like 0.99 probability. Well, no, it's not percentage, it's just the value, but in percentage shift will means 99%, right, that we are looking at Digit zero.

[00:03:13] And others will have something really, really, really, really small. But some of those all really, really small probabilities will end up, okay, the sum will be equal to 1. So that's what softmaxes do.