Speech to Text

CreatiCode

Introduction

In many applications, we need to convert what a user says into text. This is called “speech recognition” or “voice recognition”. Note that this is NOT about understanding what the user is saying. It is simply guessing what the user is saying based on the speech audio.

However, this can already be very useful in some applications. For example, for users with disabilities who have difficulty using a mouse or keyboard, being able to input commands with voice is essential.

Basic Speech Recognition

There are 4 key blocks that work together to carry out a basic speech recognition:

Start Speech Recognition

To start, we need to run this block:

Language: The first input is the language the user will speak. Note that even for the same language, such as English, there is more than one type you can choose from, such as “English (United States)” vs “English (United Kingdom)”. If the user talks in a different language, then the recognition result will not be correct.
Sound Name: The second input is optional. If it is not empty, then the user’s voice will be saved as a sound clip under the “Sounds” tab for this sprite. This will be handy if you need to play back what the user has just said.

When you run this block, the playground will try to get your permission to open the computer’s microphone. After getting the consent, it will keep listening until we stop the speech recognition (see below).

End Speech Recognition

To stop the speech recognition, you can use this block:

When this block is executed, the program will stop listening to the microphone, and the text that’s recognized will be stored in the reporter block “text from speech” (see below).

Note that before this block runs, no recognition will be carried out and no result text will be shown. You can think of it as a 2-step process:

Between the “start recognizing speech” block and the “end speech recognition” block, the system will do nothing except record what it hears from the microphone.
After the “end speech recognition” block runs, the recorded audio clip will be sent to a server (another computer) that converts the audio to text.

Text from Speech

The recognized text will be stored in this reporter block, so you can think of it as a special variable:

Clear Speech Text

This block can be used to clear the value of the previously recognized text that’s stored in the “text from speech” block. This way, there is no confusion about whether its value is from a previous session or the current session.

When to stop?

One of the key design questions for using these speech recognition blocks is when to stop recording audio. If we stop too early, then the user may not have enough time to say what they need to say; if we stop too late, then it is a waste of the user’s time.

Next, let’s look at a few typical methods to handle this issue.

Example Design 1 - Wait and Then Stop

In this example, we will listen to the user for a fixed number of seconds (e.g. 3) and record it as a sound clip named “s1”. After ending the recognition, make the dog say the recognized text and also play back the sound clip s1.

You can try it out here: play.creaticode.com/projects/67a189b37b46e6c9fc9d7618

As you try out this program, test different waiting times. You will likely find that it is challenging to determine a suitable waiting time value that works consistently. Sometimes it is not long enough, and sometimes it is too long.

Example Design 2 - Start and Stop Buttons

In the second example, we will show 2 buttons, one for starting the recognition, and another for stopping it. After clicking the start button, we start recording the speech. When the stop button is clicked, we will end the recording and run the recognition.

Here are the blocks:

You can try it here: play.creaticode.com/projects/680e6446955fd2f624f578cd

This method gives the user much more control, as you can say anything of any length before clicking the stop button. For a slightly simpler design, you can also “merge” these 2 buttons into one: when the “start” button is clicked, its text changes to “stop”, and when it is clicked again, it changes back to “start”.

Example Design 3 - Push to Talk

When the user only needs to give a very short command, the design 2 may seem “too much work”, since the user needs to click 2 times. Therefore, for very short speeches, we can use a “push-to-talk” design, which is similar to how walkie-talkies work: the user presses down a button to start talking, and releases it to stop.

Here are the blocks for the button sprite:

And here are the blocks for the Dog sprite:

You can try it here: play.creaticode.com/projects/680e6a9b955fd2f624f57f3e

Continuous Speech Recognition

As explained earlier, when you use the blocks above, no recognition is done while the user speaks. The recognition work only starts when you run “end speech recognition”. If the user talks for a long time, this may be very inconvenient, since we do not have the recognized text while the user is speaking.

There is another way to do speech recognition continuously. In this mode, the speech of the user will be sent to the server in a stream of chunks, so the server will convert them to text while the user speaks and return them sentence by sentence.

You can use the following 2 blocks for this new method.

When this block is executed, the system will start listening on the microphone and streaming the audio chunks to the server. When it receives the recognized text (also comes in chunks), the text will be stored in the given list. Each part (usually a new sentence) will be stored as a new item in the list.

When this block runs, the system will stop listening on the microphone right away and send all remaining audio chunks to the server. It will also wait for the server to return all the translated text parts and store them in the list.

Example Program 1

In this simple example, we run continuous speech recognition for 10 seconds, and as the user speaks, the recognized text will be appended to the given list part by part. We no longer need to wait until we end the speech recognition to check what the user is saying.

You can try it here: play.creaticode.com/projects/680e547e68d8b3054e15aecd

Note that we still need to decide when to stop the recognition. Since the speech is usually long when we use continuous recognition, it is probably best to allow the user to manually start and stop the recognition (example design 2 above).

Example Program 2

Sometimes, when we receive a newly recognized sentence, we need to display it, so the user can see what is being recognized:

We need to extract the new text from the list “parts”. One way to do this is to use a new variable, “row count,” to keep track of how many rows of text have been displayed already. So if the number of rows in the list “parts” is more than the “row count”, that means there is some new text in the list. Then, we can extract the new text and display it:

You can try it here: play.creaticode.com/projects/680e945d955fd2f624f5b075

Exercise

Modify the project above so that the user can speak out commands like “turn left” or “turn right”, and a car sprite changes its direction accordingly.