Microsoft unveils VALL-E, a text-to-speech AI that can mimic a voice from seconds of audio
Microsoft Corp. today gave us a peek at a text-to-speech artificial intelligence tool it’s been working on that can apparently simulate a voice after listening to just three seconds of an audio sample.
The company said that this tool, VALL-E, can keep the emotional tone of the speaker for the rest of the message while also simulating the acoustics of the room from which it first heard the voice. Not only can it do this from a short audio sample – which is unheard of so far – but Microsoft is saying no other AI model can sound as natural.
Voice simulation is nothing new. In the past, it’s been used to simulate human voices, but not always for the best of reasons. The concern here is that the more such AI improves, the better the audio deepfakes, and then we might have a problem. At the moment, it’s impossible to know just how good VALL-E is since Microsoft has not released the tool to the public, although it has provided samples of the work that’s been done. It’s frankly very impressive, if, indeed, that mimicry took only three seconds, and the voice could go on to speak for any length of time.
If it’s as good as Microsoft says it is and can quickly sound as human as a human, charisma and all, you can see why Microsoft wants to invest heavily in the AI that has just taken the world by storm, OpenAI LLC’s ChatGPT. If combined, perhaps people asking questions on the phone at call centers will not be able to distinguish a human from a robot. Maybe the tools together might also be able to create what seems like a podcast, except the guest is not real.
A powerful tool that can perfectly mimic someone’s voice after just a few seconds is concerning. In the hands of the wrong people, it could be used to spread misinformation, mimicking the voices of politicians, journalists, or celebrities. It seems Microsoft is well aware of the potential misuse.
“Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker,” Microsoft said at the conclusion of the paper. “To mitigate such risks, it is possible to build a detection model to discriminate whether an audio clip was synthesized by VALL-E. We will also put Microsoft AI Principles into practice when further developing the models.”
Photo: Volodymyr Hryshchenko/Unsplash
Show your support for our mission by joining our Cube Club and Cube Event Community of experts. Join the community that includes Amazon Web Services and Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger and many more luminaries and experts.