This new AI can simulate your voice with just 3 seconds of audio

Microsoft’s new Vall-E speech model is reportedly capable of mimicking any voice with just a three-second sample recording.

The recently released AI tool has been tested on 60,000 hours of English language data. Researchers said in a Cornell University article that it could replicate a speaker’s emotions and tone.

These findings were evidently true, even when a recording was made of words that the original speaker never actually said.

“Vall-E develops context-aware learning capabilities and can be used to synthesize high-quality personalized speech with just a 3-second registered recording of an invisible speaker as an acoustic prompt. Trial results show that Vall-E significantly outperforms the state-of-the-art zero shot [text to speech] system in terms of speech naturalness and speaker similarity,” the authors wrote. “Moreover, we find that Vall-E was able to preserve the speaker’s emotions and the acoustic environment of the acoustic prompt in synthesis.”

ANDROID SPYWARE HITS AGAIN ON FINANCIAL INSTITUTIONS AND THEIR MONEY

Microsoft Corporation booth signage is on display at CES 2023 at the Las Vegas Convention Center on January 6, 2023 in Las Vegas, Nevada.

Microsoft Corporation booth signage is on display at CES 2023 at the Las Vegas Convention Center on January 6, 2023 in Las Vegas, Nevada.
((Photo by David Becker/Getty Images))

The Vall-E samples shared on GitHub are eerily similar to the speaker prompts, although they vary in quality.

In a synthesized sentence from the Emotional Voices Database, Vall-E sleepily says, “We need to reduce the number of plastic bags.”

DISNEY CHARACTERS COMING TO AMAZON ALEXA WITH THE “HEY DISNEY” COMMAND

Microsoft's new Vall-E speech model is reportedly capable of mimicking any voice with just a three-second sample recording.

Microsoft’s new Vall-E speech model is reportedly capable of mimicking any voice with just a three-second sample recording.
(iStock)

However, research into text-to-speech AI brings with it a caveat.

“Because Vall-E can synthesize speech that preserves speaker identity, it may pose potential risks of misusing the model, such as B. spoofing voice recognition or the identity of a specific speaker,” the researchers say on this website. “We performed the experiments assuming that the user consents to be the target speaker in the speech synthesis. If the model is generalized to invisible speakers in the real world, it should include a protocol to ensure the speaker authorizes the use of their voice and a synthesized speech recognition model.”

Microsoft Corp corporate signage at the Microsoft India Development Center in Noida, India on Friday, 12/11/2022.

Microsoft Corp corporate signage at the Microsoft India Development Center in Noida, India on Friday, 12/11/2022.
(Photographer: Prakash Singh/Bloomberg via Getty Images)

CLICK HERE TO GET THE FOX NEWS APP

Currently, Vall-E, which Microsoft calls the “neural codec language model,” is not publicly available.

Leave a Reply

Your email address will not be published. Required fields are marked *