Check out a free preview of the full A Practical Guide to Machine Learning with TensorFlow 2.0 & Keras course:
The "Detecting Good & Bad Reviews" Lesson is part of the full, A Practical Guide to Machine Learning with TensorFlow 2.0 & Keras course featured in this preview video. Here's what you'd learn in this lesson:

Vadim explains that a dataset can be overfed with data, and demonstrates how to spot overfeeding and prevent it.

Get Unlimited Access Now

Transcript from the "Detecting Good & Bad Reviews" Lesson

>> Okay, so we pretty much done with the image processing at this point, right? And we saw how a couple of interesting things about convolutional neural networks and fully connected neural networks, and even how to try to explain what the neural network done. But images, it's kind of unique data source because you have a particular resolution, right?

[00:00:25] And you can rescale, and it's already in digital form, right? In pretty convenient digital form. Intensity of the pixel is just basically, this integer number which can easily converted to point if you want them. And that can be provided as the input to your neural network. And somewhat's different with text, so when you have letters, Of course, you can just simply take the character representation, right?

[00:00:58] But those characters does not directly associated with the words. So the one how you can actually converts text to numbers which can be used to train your neural network is to simply create the vocabulary of the words. And actually, if you want to play with the datasets, the one is already provided as part of the keras datasets.

[00:01:27] IMDB dataset, it's just the reviews from IMDB, well famous movie, reviews database. And simply if you try to visualize what's training looks like, just those list of integers. And the thing is, you can even specify how many unique words you want to have. In this case, 10,000 unique words, and basically those integer numbers correspond to a particular word in our reviews.

[00:01:59] And training labels, so if I just let's say, print not the training data, but training label. Labels, it will simply tell you 1 or 0, 1 being positive review and 0 being negative review. And the thing how you can do, what you can do, by the way, if you convert those, so let me go back to data quickly.

[00:02:37] So if you just convert those indexes of words back to the vocabulary, and that's I'm showing you how to do this. You're just getting the word indexes, right? And kind of transforming the dictionary to get from value to key, you basically reforming dictionary to get key from the value.

[00:02:55] And you can print out for instance, first review, so this field was just brilliant casting, da da da da da. So that's the dataset but explicit, but for our neural networks, we would like those numbers, right? And you can even see the whole can at least have words and corresponding index and those indexes, they are sorted.

[00:03:18] So the most common word will have the lowest index. So having those indexes and those words, the next step is to basically vectorize your sequence, right? Vectorization, let me quickly just go through it. Vectorization means that we'll just look at all of the indixes, create one list with, remember, we specified that we won't have 10,000 unique words.

[00:03:53] And then we will just put the In this 10,000 integer sequence, we will just flip those where we have the warts located. So let's say if we had that, I think the first index correspond to or that, there's ants and third one is A. So if we had the word the, the article, the, and ant in our article those corresponding beats will be flipped.

[00:04:29] And that will be the input to our neural network. So the simplest one if you use this hot encoding, that's what vectorization technique is used for, you can use the dense layer and so that your neural network actually can be really simple. For instance, in this case, I'm just using 16 neurons in dense layer, connecting another 16 [INAUDIBLE].

[00:04:52] [INAUDIBLE] just because I'm trying to figure out if the review was positive or negative, only one neuron is enough to actually activate it or not activate, so it will be activated if the review was positive and that is what I'm gonna be training it. So let's just quickly split our dataset into training and testing.

[00:05:19] And we can start streaming, let's see how much each of them are two seconds, yeah, it should not be that bad. So you can see that from the beginning I am getting about 82% prediction on the validation dataset, which is not bad, it's not bad at all. And we can get up to 90%, I think.

[00:05:50] So later on, I can use this trained model to provide any, yeah, let's finish the the training session because I wanna visualize how our accuracy was changing for the training data set and for the validation data set. So remember, I can take a look at the history, right?

[00:06:13] And in the history, it's a dictionary with loss binary accuracy, validation loss, and validation binary accuracy. We'll use and I can simply visualize those. So you see that the training accuracy was increasing, right? And almost converge to a 100%, but at the same time accuracy for the validation data set increased initially and then start dropping.

[00:06:39] So that's the example of overfitting and you should definitely avoid it. So what to do in this case is probably stop my training at around, I didn't know four epochs, and that's gonna be good enough. So I can simply redo this, right? So reinitialize the model and say, train my model only for four epochs and that should be done pretty quickly.

[00:07:02] And if I look at the results, so model evolution will just simply, if you provide the test, it will give you the value of the loss function and the final accuracy. So up to 88%, not bad. All right, so that's how you basically process the text and one way of the processing the text is just grading this hot encoding, the problem with this is that I'm creating 10,000 element array for each kind of input line.

[00:07:45] It means that I'm basically sacrificing a lot of memory to prepare this data, because I'm kinda switching to this representation. But it's really easy to feed this into my model afterwards. So the trade off is basically, yeah, [LAUGH] we're spending more memory but it's gonna be easier to process the data afterwards.

[00:08:08] Alternative way to prepare your words instead of playing with indexing and creating this hot-
>> I have a question.
>> Yes?
>> So you said in this example that you overfed the data and so the validation-
>> Dropped.
>> Yeah, dropped, with this be an example of, I guess we went to the end of the demo.

[00:08:32] But would this be invalid data, would you rerun this with a different input so you wouldn't see that drop or is this-
>> So basically, what happens when I rerun it, I should return back. So I was running this, the original, the first one, I was running for 20 epochs, right?

[00:08:58] So I get to some like almost 90% accuracy and then the accuracy stop dropping to something like 85% for instance, right? So with this what I did, I simply rerun my model but only used four epochs. So I basically stopped as soon as we get kinda to this point right here.

>> So you knew that this was gonna happen so the rest of the example-
>> Exactly, so I'm just simply ignore the overfitting.