A Tour of Web Capabilities

Speech Recognition API

Maximiliano Firtman

Maximiliano Firtman

Independent Consultant
A Tour of Web Capabilities

Check out a free preview of the full A Tour of Web Capabilities course

The "Speech Recognition API" Lesson is part of the full, A Tour of Web Capabilities course featured in this preview video. Here's what you'd learn in this lesson:

Max introduces the Speech Recognition API. Once microphone permissions are granted, an application can use the API to detect words or phrases in a specified language. A confidence value is also returned as the speech is processed.


Transcript from the "Speech Recognition API" Lesson

>> So now we are going to get into speech, voice, and camera. So here we have the example on the camera. So let's start with a speech recognition. So I'm not sure if you know that, but you can talk to websites. So there is an API that is green, so it's available everywhere, that allows her to listen for the user's voice and get a text transcript.

Today, it works at least on Safari only after a user gesture, and you will need microphone permission because it will open your microphone. So you start a new speech recognition, you can specify if you want continuous or not. If you say not, it will give you the first phrase, and that's all, continuous will continue giving you the answers.

Also, sometimes you can even ask for getting partial results, because when the voice is being processed, there are several steps going on that is not completely useful, but anyway. And when you get a result, an event is going to be fire, and you will get the text with a percentage of confidence.

So the API will tell you, well, I think that this is what he said or she said. And I'm using this API in my cooking app, because hey, we wanna go hands-free. So let's see it and then analyze the code. So going back to, do I have it now?

Let me load that again, Cooking Masters. So I will actually get into one of the things here and it will say, let me open the console. And before doing that, within Capabilities, there is a section that's bluetooth, we haven't seen that yet. We have a section here for microphone, okay?

Well, I will show you this code in a minute, but here I already have a console.log, so we can see this in action. And by the way, here are the keywords that I'm looking for, hey cooking next, hey cooking previous. By the way, I don't need hey cooking, I can just say next.

But just because I'm also speaking here to you, okay, I don't want the app to get crazy. So let's see, let me open the console so we can see the output of this. So let's start cooking, it's full screen but the DevTool is still there, which is okay.

Safari doesn't have that option, when Safari goes fullscreen, the DevTools goes to a different desktop, which is complicated for presentation purposes. So how to enable that? Well, there is a microphone icon there. So if I click there, it first ask me for microphone permission. Can you see that at the top?

I will say Allow, okay? And now, it's listening. Can you see down there, okay, in the console? And I can say, again, hey cooking next, Hey cooking next, hey cooking exit. It works, and it works pretty well, okay? [LAUGH] So for a cooking app, it works pretty well.

So the only thing you need is you need to enable that. And then you can continue listening forever while your fish is there, can still hear in the user and the text. And it also can receive texts in different languages. Right now, it's English, I'm not going to change everything here.

But if I try it here and change that to Spanish, that's the only language I know. So it's not gonna change anything, but at least, [FOREIGN] That's right, by the way. [FOREIGN] It works. By the way, how many languages, which languages? Actually, that's not up to the spec, that's up to the browser.

And some browsers are reliant on the OS services for that. So it's up to the browser, up to the operating system. What you can know, you can get the list of all the available languages as well. So you can try different languages. Also in the list of options that we have here in github.io/capabilities, if you wanna play with this, we have this Web Speech Synthesis demo.

Now, Speech Synthesis, not yet, Recognition demo. Well, you can try also not just English, English from Nigeria with different accent, or, no, Filipino. So those are all the languages that are supported by this browser, that is, Chrome, okay? Yeah, I don't know, for Francois, the only thing I know is [FOREIGN].

Yeah, that's the only French you will get from me. But you can see, it works, right? So it works, I know a little bit of Portuguese from, [FOREIGN] Well, I think I need to restart, I don't think it's, when you change that, this web app. Anyway, you got the idea, I don't know Klingon, should try Klingon here.

Anyway, say [FOREIGN], that's thank you very much. Any question on speech recognition? I will show you the code, you have it also, but let's review the code. It's not really complicated once you understand what it's doing. The only warning here, I will go with that in a second, is that it needs a prefix.

That looks from the past, actually, I'm not sure if you have ever encountered a prefixed API before. It was pretty common a few years ago. Today, we are going to see probably two APIs that are still prefixed with WebKit. And by the way, I mean, Chrome, is Chrome WebKit?

Not or not anymore, but it's still using that. So instead of creating a new speech recognition, we need to create a new WebKit speech recognition. Then I'm saying, continuous true, the language that I'm expecting. And then addEventListener for result, another one for a start, stop, and error. So if you have a permission issue or something happen, you will get into error.

Start means we are ready to start listening to the user. A stop is, it's a stop because you have stopped it or because the user went to permissions, stop the microphone, or whatever. I don't know, but you won't receive any more results. And then you have result, when you check if it's a result, you get an array of results because it's accumulating the text.

So I'm always getting the last one, that is, the new one, because in my case, I'm looking for the phrase. Then I'm getting the confidence as an attribute and the text, that's all. And in this case, I'm removing a possible dot at the end and also trimming this because sometimes it has spaces.

And because it's a phrase, it's a sentence, sometimes it ends with the dot. So in case you have a dot, I'm removing the dot. This depends on the browser if it adds a dot or not, and then I'm checking based on what I hear, I do things. And that's kind of all the API, it works on mobile devices, iPhone, Android, iPads.

Do you have any question?
>> In regards to changing the language, is it still gonna work with if you do Spanish while speaking English? It was have like-
>> Use the both at the same time or?
>> Yeah, yeah, so if you have Spanish as your designated language in the code but then you're speaking English.

>> So if you set up their English and then you speak in Spanish, it will take probably some words with a very low confidence. Unless you are kind of doing, instead of saying holla, you say holla. That is kind of an Englishy version of holla. But if you don't do that kind of weird part, yeah, it's not gonna work.

So it's not multilanguage, you have to specify to the API that you're switching languages to actually get the right text with the right confidence.

Learn Straight from the Experts Who Shape the Modern Web

  • In-depth Courses
  • Industry Leading Experts
  • Learning Paths
  • Live Interactive Workshops
Get Unlimited Access Now