OpenAI’s ChatGPT is receiving major updates that will enhance its capabilities to handle voice commands and image-based queries. Users can now engage in voice conversations with ChatGPT on both Android and iOS platforms, as well as feed images into the chatbot across all platforms. These new features are currently available to Plus and Enterprise users, with wider access to the image-based functions expected in the future.
To try out voice conversations, users must opt in to this feature in the ChatGPT app by navigating to Settings and then New Features. By tapping the microphone button, users can select from five different voices to interact with ChatGPT.
OpenAI has powered the back-and-forth voice conversations with a new text-to-speech model capable of generating “human-like audio from just text and a few seconds of sample speech.” These five voices were created with the assistance of professional actors. Conversely, OpenAI’s Whisper speech recognition system is utilized to convert users’ spoken words into text.
The image-based functions of ChatGPT are also quite intriguing. OpenAI states that users can show the chatbot a photo of their grill and inquire why it won’t start, seek assistance in meal planning based on a picture of their refrigerator contents, or prompt the chatbot to solve a math problem by capturing an image of it. It’s interesting to note that Microsoft recently showcased its Copilot AI’s ability to solve math problems at the Surface event.
OpenAI leverages GPT-3.5 and GPT-4 to power the image recognition features in ChatGPT. To utilize the chatbot’s image-based functions, users simply need to tap the photo button (iOS users may need to tap the plus button first) to capture a photo or select an existing image on their device. ChatGPT supports multiple photos, and users can also use a drawing tool to focus on specific parts of an image.
In an announcement regarding these updates, OpenAI acknowledges the potential for harm that this technology poses. It is conceivable for bad actors to mimic the voices of public figures or even regular individuals, potentially leading to fraud. For this reason, OpenAI is primarily focusing on voice conversations with ChatGPT. However, the company is also collaborating with select partners on limited use cases. OpenAI has published a paper on the safety properties of the image-based functionality, referred to as GPT-4 with vision.
ChatGPT demonstrates better performance in understanding English text in images compared to other languages. OpenAI advises non-English users to refrain from using ChatGPT for text in images for the time being, particularly for languages that use non-Roman scripts.
Meanwhile, Spotify has joined forces with OpenAI to leverage the voice-based technology for an intriguing purpose. Spotify is piloting a tool called Voice Translation for podcasters, which can translate podcasts into different languages using the voices of the podcast participants. This innovative tool retains the speech characteristics of the original speaker, even after the conversion to other languages. Initially, Spotify is converting select English-based shows into several languages, with Spanish versions of certain “Armchair Expert” and “The Diary of a CEO with Steven Bartlett” episodes already available, and French and German versions to follow.
With these advancements in voice and image-based capabilities, ChatGPT is expanding its conversational AI capabilities and finding applications in various contexts, including podcasts and everyday user interactions. The continuous development and deployment of such technologies raise important considerations regarding potential risks and the need for responsible implementation to mitigate any adverse consequences.