It’s the commotion at Microsoft on artificial intelligence: the firm has developed a tool called “Vall-E” which makes it possible to create voice replicas from a three-second recording. In addition to simply reproducing a voice, this AI can reproduce emotions.
Source : Turag Photography via Unsplash
At the start of 2023, the trend is undeniably towards artificial intelligence and automatic generation tools. On Microsoft’s side, the company has created its own DALL-E 2, and would like to integrate ChatGPT into Bing to compete with Google. Also, Microsoft would like to invest 10 billion dollars in OpenAI to integrate AI tools into the Office suite. A busy start to the year that is not over: with Vall-E, Microsoft can reproduce the human voice from just three seconds of recording.
Vall-E: Microsoft’s artificial intelligence that can reproduce a voice
A few days ago, Microsoft published a scientific article presenting “a language modeling approach for text-to-speech synthesis”. A text-to-speech tool that doesn’t just turn text into a robotic voice created from scratch, but into a voice created from a real, human voice. The developers say they trained their model for 60,000 hours in English. According to them, these are “hundreds of times more than existing systems”.
Diagram of how Vall-E works // Source: Microsoft
With its capabilities, Vall-E “can be used to synthesize high-quality personalized speech with only a 3-second recording of an unknown speaker as an acoustic guide.” Words can therefore be pronounced by a voice without the latter ever having pronounced them. In addition, the tool “can preserve the speaker’s emotion and the acoustic environment of the acoustic guest in synthesis.”
Obviously, the more samples, the more accurate the recreated voice. If the recordings generated and published by Microsoft are not all convincing, they were with three seconds of recording. With more samples, one can imagine that the AI is more efficient.
What can this reproduction voice synthesis be used for?
In the presentation of Vall-E, some possible uses were detailed: “VALL-E directly enables various voice synthesis applications, such as TTS (text-to-speech, text to voice in French), voice editing and content creation, in combination with other generative AI models like GTP-3”.
However, Vall-E could be used for less honest purposes. For several years, deep fake technology has been democratizing: it consists of modifying videos or images to attach a person’s face to a body that does not belong to them, in order to deceive. If at the moment Vall-E is not available, Microsoft has not put anything in place to prevent these problems.
The developers imagine that “speech editing models should be accompanied by relevant components, including the protocol to ensure that the speaker agrees to perform the modification and the system to detect the edited speech”.
An explanatory diagram about Dall-E // Source: OpenAI
If the tool exists and if the demonstrations are encouraging, Microsoft’s biggest challenge is not technical, but ethical. Public figures, some of whom are already victims of deep fakes, could naturally be the most impacted. Moreover, one can imagine that Vall-E is used in addition to a deep fake video tool, to create scandalous fake videos.
Also, Vall-E could very well be used to impersonate someone on the phone. As for artists with automatic image generation AIs, Microsoft’s tool could endanger the jobs of many people: voiceover professionals, dubbing professionals, etc.
Everyone is in the race for generative AI
At the same time, other automatic generation tools are under development. A few weeks ago, OpenAI, the company behind ChatGPT, presented Point-E, a tool for generating 3D models. Microsoft is far from being the only GAMAM in the game, since Meta manages to create videos from text and Google is working hard to develop tools from AI.
Result for “An astronaut riding a horse in a photorealistic style” // Source: OpenAI
Apple has even gone further since the company is marketing a series of audio books with an artificial narrator, generated by AI. In the video game High On Life, a character was even doubled by an AI.
ChatGPT has been making headlines on every tech news site since its launch, and rightly so. This automated chat can be stunning in some of its responses. We put it to the test with several questions…
Read more
To follow us, we invite you to download our Android and iOS application. You can read our articles, files, and watch our latest YouTube videos.