An assistant that can see and talk (Difficulty: 3)

CreatiCode

Introduction

Large language models today are usually multi-modal. This means they can not only chat using words but also understand images.
This is incredibly useful — sometimes, it’s much easier to show a picture than to try to describe something with words!

In this tutorial, you will create an AI assistant that can see and talk.
The user just needs to take a picture with the camera and then ask the AI any question about that picture!

Step 1 - Create a new project

On CreatiCode.com, log in and create a new project. Remove the dog sprite, and rename the project to “AI Assistant”.

Step 2 - Search for or generate a backdrop

When the project starts, we want a cool backdrop to show that this is an AI assistant.

Switch to the Stage and add a new backdrop.
Search for something like “a helpful AI assistant” to find interesting designs:

You can also generate a new backdrop based on your own idea. For example, suppose we want this assistant to be used as a tour guide, then we can generate the backdrop with a detailed description like this:

a robot tour guide facing the viewer with both arms open, standing in front of a historical site, cartoon style

And you might get a result like this:

alt text

Step 3 - Show 2 buttons when the green flag is clicked

Now switch to the empty sprite.

We’ll add two buttons so the user can pick a camera:

Front camera (for laptops/touchpads/Chromebooks)
Back camera (for iPads/phones)

You can use the following code:

Step 4 - Show camera preview

When a button is clicked, the camera view will show up so the user can aim at the object or scene.

The only difference between the two buttons is whether you use the front or back camera.
We’ll use the same camera widget name (“camera1”) for both options.

Step 5 - Prepare to take a picture

Besides showing the camera, we also need to get ready for the user to take a picture.

To avoid repeating code, let’s make a custom block called “prepare to take picture”:

Inside this custom block:

Delete the two old buttons.
Add a new button to take a picture.

Now the stage will look like this when getting ready to snap a photo:

Step 6 - Take a picture and show it

When the user clicks the “Take a picture” button, we will save the current camera view as a costume image named “c”. We will also remove all the widgets (the camera view and the button) so that the newly captured costume image is shown to the user:

Step 7 - Prepare for the user question

After taking the picture, add more blocks to:

Create a new button for the user to ask a question using speech.
Add a textbox to show the recognized question.

Make the textbox background 30% transparent so the captured costume image stays visible behind it.

Result:

Step 8 - Recognize the user’s question

When the user clicks “Ask a Question”:

Start speech recognition for 8 seconds, which should be long enough for most questions.
Show the recognized text inside the textbox.

You can also use “continuous speech recognition”, and stop recognition when the user has completed a full sentence. To keep it simple, we will just use the time-based cutoff time.

To test it, click the “Ask a question” button, and ask a question like “what is this?”, and then it will be recognized and displayed in the textbox:

Step 9 - Ask AI a question about this image

Finally, we can send the picture and the question to the AI!

You’ll need two blocks (the LLM block is wide, so it’s shown in two rows):

Here is how it works:

Attach the costume image “c” to the chat: this step will not send the image to the AI (LLM) yet. It only stores the image as part of the chat. You can attach more than one image to a chat session, but for this project, we only need to attach one image.
Send a chat message to AI (LLM): this block will send out the prompt together with the image attached above. We will use a simple prompt: “Answer verbally in 50 words:\n”. The keyword “verbally” ensures the AI’s answer is conversational and not too formal. We are also limiting it to within 50 words to avoid lengthy answers. Note that you must use the “LLM model” block instead of the “OpenAI ChatGPT” block to send images.

After these 2 blocks run, the AI’s answer will be stored in the variable “result”.

Step 10 - Display and say the answer

Once the AI responds:

Show the answer in the textbox.
Speak the answer out loud!

Also, make sure to stop any earlier speech when the user asks a new question.

The answer will look like this:

Additional Challenges

This project demonstrates how to combine many useful AI tools into one simple app, but it is kept simple intentionally. Here are some ideas you can explore to enhance this tool further:

Handle follow-up questions:
Let users keep asking more questions about the same picture.
Be careful not to re-attach the image again and again. Set the AI to “continue” mode for a smoother conversation.
Smarter Speech Recognition:
Instead of waiting exactly 8 seconds, detect when the user finishes talking, or use start/stop buttons.
Translate the Assistant:
Make it work in your native language!
Customize the AI’s Behavior:
Adjust the prompt to give hints instead of direct answers (for example, for a homework helper version).