Check out a free preview of the full Machine Learning in JavaScript with TensorFlow.js course
The "Training an Audio Model" Lesson is part of the full, Machine Learning in JavaScript with TensorFlow.js course featured in this preview video. Here's what you'd learn in this lesson:
Charlie demonstrates how to train an audio model in Teachable Machine. First, a background noise sample is captured. Then, samples for each sound are created. The model creates an image representation of each audio sample and uses that to recognize input audio.
Transcript from the "Training an Audio Model" Lesson
[00:00:00]
>> But just to get back to teachable machine itself. Yeah, so I was gonna talk briefly about using sound. So we've mostly talked about using images. But you can use transfer learning on sound samples as well. So the first thing that's a little bit more annoying for sound is that you have to have a background noise of at least, I think it's 20 seconds, because the samples are taking a second at a time.
[00:00:29]
So maybe I'll just be quiet for 20 seconds just to have an empty background noise. I'm gonna allow, so I'm just gonna be quiet. Cool, okay, so we're gonna extract the samples. And as you can see here, while it requires audio, it also creates little spectrograms. And because what happens in the background is that even though we're working with audio, it's still using a convolutional neural network in the background because it's working with images.
[00:00:59]
Especially tiny images of like these kind of pixels goes a lot faster than using raw audio data. Raw audio data would be, if you've ever used the Web Audio API, it's like thousands or millions of data points. That would take a lot longer. So instead, we're transforming raw audio data into these little images, and this is what's going to be then fed to a convolutional neural network that allows you to have training in a few seconds.
[00:01:22]
And so here in terms of background noise, it's pretty consistent. But let's say that I want to record a clap. That's gonna be an easy one, okay? And if I do once again, [SOUND]. Okay, so you can see here I'm extracting it. I mean, I don't know if you can, you can see a little bit.
[00:01:41]
So here you can see that that's a little bit different from the background noise where we could see nothing. Here, we can see that in terms of the, and I guess it's frequency, it's a little bit higher. And as I record more and more, [SOUND] you can start to see a pattern here in the two claps, yes?
[00:01:58]
>> So it's recording the amplitude and the frequency range?
>> Yep.
>> And then just kind of like, I guess, in terms of comparing it to an image, kind of reducing the pixel?
>> Exactly, yes. And then it's gonna be able to do the same thing as usually do in image recognition, where if you transform your picture into just a long, basically, a one dimensional tensor.
[00:02:22]
So just a long array of numbers that represent this, then it will see that, however, you can see here the difference that this clap kinda looks like this clap, but it doesn't look like this. And here I only recorded two, but let's try maybe another class as well where I'm going to cough.
[00:02:41]
Okay, and you should see that the pattern of what this is is different as well. So if I do, [COUGH], I see it is really quite different. So, [COUGH]. [LAUGH] Okay, and here you can see that, again, even you visually, you can see that in the background noise, there's not much going on, and then clap is kind of like a short, snap thing.
[00:03:03]
And here when I'm coughing, if you've ever done voice recognition, you can see that your voice is made of multiple frequencies, and you can kinda almost see, so it's really reduced. You can see it's quite pixelated. But it doesn't necessarily need to be much bigger to kind of start seeing a pattern.
[00:03:19]
The bigger the image is, if you don't resize them, it's also going to take a lot longer. And a lot of the times to be able to see a pattern in an image, you don't need it to be that big. So here it's not enough sample, actually it's eight minimum.
[00:03:32]
Let me try to do eight minimum and then we'll see. So I have to clap, oops. [SOUND] I'm just gonna try. [SOUND] Even when I do it here, visually, you will be able to recognize what's a clap. [SOUND] Okay, we have eight and then I have six or more coughs.
[00:03:58]
[COUGH] It's not recorded at all. [COUGH] Okay, and here let's say that's just eight samples, but that's the minimum, so it should still work. So we're going to train the model with our audio/images data, and we'll have the same. So at first if I'm speaking, it doesn't quite know what it is, so it's just gonna say background.
[00:04:29]
But then if I do, [SOUND], okay, so you see there's a bit of delay. I mean, there's an overlap factor where it could be, I think, shorter and it will do it more. But [COUGH]. Yeah, so again, I think if you play with sounds, there's always a little bit of a delay.
[00:04:44]
Let's see if, I actually didn't do that, if we do 0.10. I don't know whether it's longer, okay. [COUGH] Okay, that's much faster. So I think the bigger the overlap factor, the kind of faster it is. [SOUND] Okay, so in the same thing, this one you can export it as well, and I'm not sure if the code would be exactly the same.
[00:05:07]
Instead of taking the webcam, there's probably a teachable machine thing to have access to the microphone and things like that, question?
>> Just to confirm, it's comparing the audio files, did you say it's also comparing the images, or the image is just for the user to see?
>> It's only comparing the images.
[00:05:24]
So it's transforming the sound inputs into a spectrogram, and then it's feeding these images into the model. It's not using audio data, because audio data would be too much. And in general, when you do some kind of data processing, it's a lot easier to find patterns when you visualize some data than if you just see an array of a lot of numbers.
[00:05:46]
I mean, for us it's easier to visualize, but yes, when it's fed into the model, it's transforming the images into what's called tensors, and the machine learning model works with that. But at least for us, we're able to see, in advance, do I think my model is going to be able to figure it out or not?
[00:06:03]
Because if my claps and my coughs looked too similar to my background, there's less of a chance that the prediction will be accurate. So it's helpful for us, but it's also too sure as well that it's an image model that's used in the background as well.
Learn Straight from the Experts Who Shape the Modern Web
- In-depth Courses
- Industry Leading Experts
- Learning Paths
- Live Interactive Workshops