Introduction
Large language models like ChatGPT are generally “multi-modal,” which means they can not only chat with us in words but also understand images. This is very useful since, in many cases, it is much easier to describe what we need with a picture than words.
In this tutorial, you will build an AI assistant that can see and talk. To use it, the user just needs to take a picture using the camera and then ask the AI assistant any questions about that picture.
448b6d33-2ead-4e45-8773-cc0881336bd5-image.png
Step 1 - Create a new project
On CreatiCode.com, log in and create a new project. Remove the dog sprite, and rename the project to “AI Assistant”.
Step 2 - Search for or generate a backdrop
When the project starts, we will show a nice backdrop to the user to indicate this is an AI assistant.
Switch to the stage and try to add a new backdrop. For example, if you search “a helpful AI assistant”, you can find many interesting backdrops to choose from:
b2b538a9-48e3-480c-b4f4-e8e4db00b7d7-image.png
You can also generate a new backdrop based on your own idea. For example, suppose we want this assistant to be used as a tourguide, then we can generate the backdrop with a detailed description like this:
a robot tour guide facing the viewer with both arms open, standing in front of a historical site, cartoon style
We will get a new backdrop like this:
alt text
Step 3 - Show 2 buttons when the green flag is clicked
Now switch to the empty sprite and add code to show 2 buttons, which will allow the user to pick a camera. The front camera is usually available on touchpads or Chromebooks, while the back camera is usually available on iPads or smartphones.
6b93ef86-699b-4461-93d8-9dc7dbfb28b9-image.png
You can use the following code:
92fe6a32-0f07-4d2c-9ced-bdab9f8d1879-image.png
Step 4 - Show camera preview
When either button is clicked, we need to show the view from the camera, so that the user can aim at the object or place. The only difference is whether to show the “front” or “back” camera. Since only one of those buttons will be clicked, we can call the camera widget as “camera1” in both cases.
0ede859c-0d0f-49c5-8093-825d2e2a97d4-image.png
Step 5 - Prepare to take picture
Besides showing the camera preview, we also need to do some other work to prepare to take a picture. Since the work is the same no matter which camera is used, we can define the common work in a new custom block “prepare to take picture”:
8deb6639-60a8-497a-b4f7-fe4897bd642b-image.png
In that new block’s definition, we will add more blocks to delete the existing buttons (button1 and button2), and add a third button for taking the picture:
ce8bd9ac-6500-4a4c-88ec-2d9649616ecf-image.png
It will look like this (suppose we want to take a picture of a USB stick):
6cbf3772-ae0f-41b7-a5ed-47722983390d-image.png
Step 6 - Take a picture and show it
When the user clicks the “Take a picture” button, we will save the current camera view as a costume image named “c” and show it to the user:
500a1014-6d75-4f29-bb07-0248f5b5ad1a-image.png
31dd923b-69e8-4462-a8d1-dc8fbe0f545d-image.png
This way, the camera preview and button3 will both be removed, and we will simply show the newly captured costume image.
Step 7 - Prepare for the user question
Next, add a new more blocks below. They will add a new button, which would turn on speech recognition, so the user can ask a question. This is much more convenient than asking the user to type the question, especially on mobile devices. We will also add a textbox to show the question we recognized. The textbox has a transparency of 30% for its background, so that the user can still see the picture they have taken.
3e36e3a2-82c6-4334-8ce3-4deb6a98d7e4-image.png
For example, it will look like this:
6d2508ce-6e56-479b-ad1d-ea953ee611cb-image.png
Step 8 - Recognize the user question
When the user clicks “Ask a question”, we will turn on speech recognition and allow the user to talk for up to 8 seconds. This should be long enough for most questions, but you can change the time length shorter or longer. We will then display the recognized text in the text box as a question:
3249a6ef-db5c-49fe-b819-dd234954bd6b-image.png
You can also use “continuous speech recognition”, and stop recognition when the user has completed a full sentence. To keep it simple, we will just use the time-based cutoff time.
To test it, click the “Ask a question” button, and say your question, such as “what is this?”, and then it will be recognized and displayed in the textbox:
9e6a4111-0c02-42e2-bcd6-ba9155e7e12c-image.png
Step 9 - Ask ChatGPT a question about this image
Now we are finally ready to ask ChatGPT to answer the user question based on the image. It actually requires 2 blocks that work together:
fca5b179-ca34-4296-8ff6-13d546c30698-image.png
Attach the costume image to the chat: this step will not send the image to ChatGPT yet. It only stores the image as part of the chat. You can attach more than one image to a chat session, but for this project we just need to attach one image.
Send a chat message to ChatGPT: this block will send out the prompt together with the image attached above. We will use a simple prompt: “Answer verbally in 50 words:\n”. The keyword “verbally” ensures ChatGPT’s answer is conversational and not too formal. We are also limiting it to within 50 words to avoid lengthy answers.
After these 2 blocks run, ChatGPT’s answer will be stored in the variable “result”.
Step 10 - Display and say the answer
We will not only show the answer in the textbox but also convert it to speech. This way, users can listen to the answer most of the time, and they only need to read the text if they miss some details.
12782a50-c239-45ec-82e4-0dff8ee2d06c-image.png
The answer will look like this:
b3ee2f44-e4e1-4720-a50a-622a83a559dd-image.png
Extra Challenges
This project demonstrates how to combine many useful AI tools into one simple app, but it is kept simple intentionally. Here are some ideas you can explore to enhance this tool further:
Handle follow-up questions: Allow the users to ask more questions. You need to make sure the costume image is not added again every time. Also, make sure ChatGPT is in “continue” mode, so it has the context of earlier questions.
Avoid the fixed wait time: Instead of always waiting 8 seconds, you can automatically detect if the user has finished asking the question. Alternatively, you can also use 2 buttons: clicking one button to start speech recognition and clicking another button to stop it.
Translate it to another language: You can change the tool to your native language if it is not English.
Customize the prompt: The current prompt is very short and generic. Suppose you want to change this tool to a “homework helper”, then you may need to add more instructions in the prompt. For example, you may want to tell ChatGPT that it should only provide hints and never reveal answers directly.