Navigation

    CreatiCode Scratch Forum

    • Register
    • Login
    • Search
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • CreatiCode

    Speech to Text

    2D Blocks
    1
    1
    324
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • info-creaticode
      CreatiCode last edited by admin

       

      Introduction

       

      In many applications, we need to convert what a user says into text. This is called “speech recognition” or “voice recognition”. Note that this is NOT about understanding what the user is saying. It is simply guessing what the user is saying based on the speech audio.

      However, this can already be very useful in some applications. For example, for users with disabilities who have difficulty using a mouse or keyboard, being able to input commands with voice is essential.

       

      Basic Speech Recognition

       
      There are 4 key blocks that work together to carry out a basic speech recognition:

       

      Start Speech Recognition

       
      To start, we need to run this block:

      5b680179-60fc-4ee0-9845-dbb755505a01-image.png

      • Language: The first input is the language the user will speak. Note that even for the same language, such as English, there is more than one type you can choose from, such as “English (United States)” vs “English (United Kingdom)”. If the user talks in a different language, then the recognition result will not be correct.

      • Sound Name: The second input is optional. If it is not empty, then the user’s voice will be saved as a sound clip under the “Sounds” tab for this sprite. This will be handy if you need to play back what the user has just said.

      When you run this block, the playground will try to get your permission to open the computer’s microphone. After getting the consent, it will keep listening until we stop the speech recognition (see below).

       

      End Speech Recognition

       
      To stop the speech recognition, you can use this block:

      e6cb83e0-20ac-4498-a5e7-c0a4ad088df3-image.png

      When this block is executed, the program will stop listening to the microphone, and the text that’s recognized will be stored in the reporter block “text from speech” (see below).

      Note that before this block runs, no recognition will be carried out and no result text will be shown. You can think of it as a 2-step process:

      1. Between the “start recognizing speech” block and the “end speech recognition” block, the system will do nothing except record what it hears from the microphone.

      2. After the “end speech recognition” block runs, the recorded audio clip will be sent to a server (another computer) that converts the audio to text.

       

      Text from Speech

       
      The recognized text will be stored in this reporter block, so you can think of it as a special variable:

      d690c8f6-9602-453e-bd76-fb880d1dee52-image.png

       

      Clear Speech Text

       
      This block can be used to clear the value of the previously recognized text that’s stored in the “text from speech” block. This way, there is no confusion about whether its value is from a previous session or the current session.

       

      When to stop?

       
      One of the key design questions for using these speech recognition blocks is when to stop recording audio. If we stop too early, then the user may not have enough time to say what they need to say; if we stop too late, then it is a waste of the user’s time.

      Next, let’s look at a few typical methods to handle this issue.

       

      Example Design 1 - Wait and Then Stop

       

      In this example, we will listen to the user for a fixed number of seconds (e.g. 3) and record it as a sound clip named “s1”. After ending the recognition, make the dog say the recognized text and also play back the sound clip s1.

      speechrec3seconds.gif

       
      You can try it out here: play.creaticode.com/projects/67a189b37b46e6c9fc9d7618

      As you try out this program, test different waiting times. You will likely find that it is challenging to determine a suitable waiting time value that works consistently. Sometimes it is not long enough, and sometimes it is too long.

       

      Example Design 2 - Start and Stop Buttons

       

      In the second example, we will show 2 buttons, one for starting the recognition, and another for stopping it. After clicking the start button, we start recording the speech. When the stop button is clicked, we will end the recording and run the recognition.

      startandstop.gif

       
      Here are the blocks:

      09e27093-18d9-49aa-b121-adcd53be728d-image.png

       
      You can try it here: play.creaticode.com/projects/680e6446955fd2f624f578cd

      This method gives the user much more control, as you can say anything of any length before clicking the stop button. For a slightly simpler design, you can also “merge” these 2 buttons into one: when the “start” button is clicked, its text changes to “stop”, and when it is clicked again, it changes back to “start”.

       

      Example Design 3 - Push to Talk

       

      When the user only needs to give a very short command, the design 2 may seem “too much work”, since the user needs to click 2 times. Therefore, for very short speeches, we can use a “push-to-talk” design, which is similar to how walkie-talkies work: the user presses down a button to start talking, and releases it to stop.

      pushtotalk.gif

      Here are the blocks for the button sprite:

      1fdf7270-7ffb-49dd-8caf-9517cab5b296-image.png

       
      And here are the blocks for the Dog sprite:

      b673cde0-c6eb-47c0-a296-e6065aacbc6c-image.png

       
      You can try it here: play.creaticode.com/projects/680e6a9b955fd2f624f57f3e

        
       
       

      Continuous Speech Recognition

       

      As explained earlier, when you use the blocks above, no recognition is done while the user speaks. The recognition work only starts when you run “end speech recognition”. If the user talks for a long time, this may be very inconvenient, since we do not have the recognized text while the user is speaking.

      There is another way to do speech recognition continuously. In this mode, the speech of the user will be sent to the server in a stream of chunks, so the server will convert them to text while the user speaks and return them sentence by sentence.

      You can use the following 2 blocks for this new method.

       

      Start-Continuous-Speech-Recognition.png

       
      When this block is executed, the system will start listening on the microphone and streaming the audio chunks to the server. When it receives the recognized text (also comes in chunks), the text will be stored in the given list. Each part (usually a new sentence) will be stored as a new item in the list.

       

      3ca29306-b30e-4c17-bb78-bc713c098baf-image.png

       

      When this block runs, the system will stop listening on the microphone right away and send all remaining audio chunks to the server. It will also wait for the server to return all the translated text parts and store them in the list.

       

      Example Program 1

       

      In this simple example, we run continuous speech recognition for 10 seconds, and as the user speaks, the recognized text will be appended to the given list part by part. We no longer need to wait until we end the speech recognition to check what the user is saying.

      continuousspeech.gif

       
      You can try it here: play.creaticode.com/projects/680e547e68d8b3054e15aecd

       
      Note that we still need to decide when to stop the recognition. Since the speech is usually long when we use continuous recognition, it is probably best to allow the user to manually start and stop the recognition (example design 2 above).

       

      Example Program 2

       

      Sometimes, when we receive a newly recognized sentence, we need to display it, so the user can see what is being recognized:

      continuousrecog.gif

      We need to extract the new text from the list “parts”. One way to do this is to use a new variable, “row count,” to keep track of how many rows of text have been displayed already. So if the number of rows in the list “parts” is more than the “row count”, that means there is some new text in the list. Then, we can extract the new text and display it:

      44f9077c-1fc9-43e6-a509-44b93f25ba2a-image.png

       
      You can try it here: play.creaticode.com/projects/680e945d955fd2f624f5b075

       
       

      Exercise

       

      Modify the project above so that the user can speak out commands like “turn left” or “turn right”, and a car sprite changes its direction accordingly.

      leftiright.gif

      1 Reply Last reply Reply Quote 0
      • First post
        Last post