Microsoft’s new Vall-E speech model is reportedly capable of mimicking any voice with just a three-second sample recording.
The recently released AI tool has been tested on 60,000 hours of English language data. Researchers said in a Cornell University article that it could replicate a speaker’s emotions and tone.
These findings were evidently true, even when a recording was made of words that the original speaker never actually said.
“Vall-E develops context-aware learning capabilities and can be used to synthesize high-quality personalized speech with just a 3-second registered recording of an invisible speaker as an acoustic prompt. Trial results show that Vall-E significantly outperforms the state-of-the-art zero shot [text to speech] system in terms of speech naturalness and speaker similarity,” the authors wrote. “Moreover, we find that Vall-E was able to preserve the speaker’s emotions and the acoustic environment of the acoustic prompt in synthesis.”
ANDROID SPYWARE HITS AGAIN ON FINANCIAL INSTITUTIONS AND THEIR MONEY
The Vall-E samples shared on GitHub are eerily similar to the speaker prompts, although they vary in quality.
In a synthesized sentence from the Emotional Voices Database, Vall-E sleepily says, “We need to reduce the number of plastic bags.”
DISNEY CHARACTERS COMING TO AMAZON ALEXA WITH THE “HEY DISNEY” COMMAND
However, research into text-to-speech AI brings with it a caveat.
“Because Vall-E can synthesize speech that preserves speaker identity, it may pose potential risks of misusing the model, such as B. spoofing voice recognition or the identity of a specific speaker,” the researchers say on this website. “We performed the experiments assuming that the user consents to be the target speaker in the speech synthesis. If the model is generalized to invisible speakers in the real world, it should include a protocol to ensure the speaker authorizes the use of their voice and a synthesized speech recognition model.”
CLICK HERE TO GET THE FOX NEWS APP
Currently, Vall-E, which Microsoft calls the “neural codec language model,” is not publicly available.