Speech to Text
-
Introduction
In many applications, we need to convert what a user says into text. This is called “speech recognition” or “voice recognition”. Note that this is still far away from understanding what the user is saying. It is simply guessing what the user is saying based on the speech audio. However, this can already be very useful in some applications. For example, we can allow a user to say commands like “go” and “stop” to drive a car by voice.
Basic Speech Recognition
There are 4 key blocks that work together to carry out a basic speech recognition:
Start Speech Recognition
To start, we need to run this block:
-
Language: the first input is the language the user will speak. Note that even for the same language, such as English, there are more than one accents you can choose from, such as “English (United States)” vs “English (United Kingdom)”. If the user talks in a different language, then the recognition result will not be correct.
-
Sound Name: the second input is optional. If it is not empty, then the user’s speech will be saved as a sound clip under the “Sounds” tab for this sprite. This will be handy if you need to play back what the user has just said.
When you run this block, the playground will try to get your permission to open the computer’s microphone. After getting the permission, it will keep listening until you stop it.
End Speech Recognition
To end the speech recognition, you can use this block:
After this block runs, the program will stop listening to the microphone, and the text that’s recognized will be stored in this block below.
Note that before this block runs, no recognition will be carried out and no result text will be shown. You can think of it as a 2-step process:
-
Between the “start recognizing speech” block and the “end speech recognition” block, the system will do nothing but record what it hears from the microphone.
-
After the “end speech recognition” block runs, the recorded audio clip will be sent to a server that converts it to text.
Text from Speech
The recognized text will be stored in this reporter block, so you can think of it as a special variable:
Clear Speech Text
This block can be used to clear the value of the previously recognized text that’s stored in the “text from speech” block. This way, there is no confusion about whether the text is from a previous session or the current session.
Example Program
In this example, we will listen to the user for 3 seconds and record it as a sound clip named “s1”. After ending the recognition, make the dog say the recognized text and also play back the sound clip s1.
Continuous Speech Recognition
There is another way to do speech recognition continuously. As explained earlier, when you use the 4 blocks above, no recognition is done while the user speaks. The recognition work only starts when you run “end speech recognition”. If the user talks for a long time, this may be very inconvenient, since we do not have the recognized text while the user is talking to the microphone.
In contrast, in the continuous speech recognition, the speech of the user will be sent to the server in a stream of chunks, so the server will convert them to text as the user speaks. You need to learn the following 2 blocks to use this new method.
When this block is used, the system will start listening on the microphone and streaming the audio chunks to the server. When it receives the recognized text (also comes in chunks), the text will be stored in the given list. Each part will be stored as a new item in the list.When this block runs, the system will stop listening on the microphone right away and send all remaining audio chunks to the server. It will also wait for the server to return all the translated text parts and store them in the list.
Example Program
In this simple example, we run continuous speech recognition for 10 seconds, and as the user speaks, the recognized text will be appended to the given list part by part. We no longer need to wait until we end the speech recognition to check what the user is saying.
-