OpenAI’s latest text-to-speech model, Voice Engine, can replicate any voice using only a brief 15-second audio clip
OpenAI has introduced a new ‘Voice Generation’ text-to-speech generative AI model capable of accurately replicating any voice using just a 15-second audio sample. Initially, the model is available to a limited number of users, including select international partners in sectors such as governments, media, entertainment, and education.
OpenAI’s text-to-voice generative AI model is reportedly applicable in numerous real-world scenarios. These include aiding with reading, translating content, generating audio, connecting with global audiences, assisting non-verbal individuals, aiding in voice recovery for patients, and more.
In an official blog post, OpenAI stated, “Today we are sharing initial insights and findings from a limited-scale preview of a model named Voice Engine. This model utilizes text input and a single 15-second audio sample to produce speech that sounds natural and closely resembles the original speaker. It is remarkable that such a small model, with just a single 15-second sample, can create expressive and lifelike voices.”
OpenAI emphasizes that Voice Generation is a compact model, initially created in 2022 and offered to specific users through the text-to-speech API, ChatGPT Voice, and Read Aloud. To mitigate potential misuse, the company is reportedly proceeding with a “careful and knowledgeable strategy towards wider availability.” OpenAI has also provided examples of output generated by the Voice Generation model.
Prior to its official release to the public, OpenAI is evaluating several factors. These include developing policies to safeguard individual voices in AI, educating the public on comprehending the capabilities and constraints of AI, and integrating technologies that could aid users in discerning between genuine and AI-generated voices.