According to OpenAI the latest ChatGPT upgrade allows its AI to “see, hear, and speak,”.

Written By: Aniqa Batool
Last Updated On: October 10, 2023

The latest ChatGPT upgrade allows its AI to "see, hear, and speak," according to OpenAI.

OpenAI released a big improvement to ChatGPT on Monday, allowing its GPT-3.5 and GPT-4 AI models to analyze and react to photos as part of a text chat. According to OpenAI, the ChatGPT mobile app will also include speech synthesis capabilities that, when combined with its existing speech recognition features, will enable fully spoken chats with the AI assistant.

OpenAI intends to make these functionalities available to ChatGPT Plus and Enterprise members “over the next two weeks.” It also mentions that speech synthesis would be available solely on iOS and Android, while image recognition would be available on both the web interface and mobile apps.

According to OpenAI, the new picture recognition capability in ChatGPT allows users to upload one or more photographs for conversation, with either the GPT-3.5 or GPT-4 models being used. According to the company’s promotional blog post, the function may be utilized for a range of common applications, ranging from determining what’s for dinner by photographing the fridge and pantry to diagnosing why your grill won’t start. It also states that users can use the touch screen on their device to circle areas of the image that they want ChatGPT to focus on.

On its website, OpenAI features a promotional video depicting a possible discussion with ChatGPT in which a user asks how to lift a bicycle seat, complete with photographs, an instruction booklet, and an image of the user’s toolbox. ChatGPT responds and guides the user through the process. We have not personally tested this feature, thus its real-world effectiveness is unknown.

So, how exactly does it work? OpenAI has not released technical details about how GPT-4 or its multimodal version, GPT-4V, work under the hood, but based on previous AI research (including OpenAI partner Microsoft), multimodal AI models typically transform text and images into a shared encoding space, allowing them to process different types of data using the same neural network. CLIP could be used by OpenAI to bridge the gap between visual and text data by aligning picture and word representations in the same latent space, creating a form of vectorized web of data links. That technique could allow ChatGPT to make contextual deductions across text and images, though this is speculative on our part.

OpenAI acknowledges several limitations to the expanded features of ChatGPT in its recent ChatGPT update announcement, ranging from the potential for visual confabulations (i.e., misidentifying something) to the vision model’s less-than-perfect recognition of non-English languages. The company claims to have conducted risk assessments “in domains such as extremism and scientific proficiency” and sought input from alpha testers, but still advises caution.

Read More