Microsoft researchers have announced a new application that uses artificial intelligence to create a person’s voice with just a few seconds of training. The application, called VALL-E, can be used to synthesize high-quality personalized speech with only a three-second enrollment recording of a speaker as an acoustic prompt, the researchers wrote in a paper published on arXiv, a free distribution service and an open-access archive for scholarly articles.
One of the standout features of this model is its ability to perform this task in a matter of seconds, which is very impressive compared to existing state-of-the-art text-to-speech (TTS) systems that can take an hour or more to train. According to the researchers, VALL-E significantly outperforms existing TTS systems in both speech naturalness and speaker similarity. Additionally, VALL-E can preserve a speaker’s emotions and acoustic environment, so if a speech sample were recorded over a phone, for example, the text using that voice would sound like it was being read through a phone.
However, experts have raised some concerns about the technology. Unlike OpenAI, the maker of ChatGPT, Microsoft has not yet opened VALL-E to the public, so questions remain about its performance. For example, there may be factors that could cause degradation of the speech produced by the application. Additionally, some experts have raised concerns about the technology’s ability to emulate a speaker’s emotions and the potential for abuse if the technology is as good as Microsoft claims.
Despite these concerns, experts also see many beneficial applications for VALL-E. Ritu Jyoti, group vice president for AI and automation at IDC, a global market research company, called VALL-E “significant and super impressive.” Jyoti cited speech editing and replacing voice actors as potential uses of the technology. Giacomo Miceli, a computer scientist, noted the technology could be used to create editing tools for podcasters, customize the voice of smart speakers, and be incorporated into messaging systems and chat rooms, videogames, and even navigation systems.
However, Miceli also pointed out that there are also some not-so-beneficial applications of the technology. For example, a malicious user could clone the voice of a politician and have them say things that sound preposterous or inflammatory, or in general, to spread false information or propaganda. Mark N. Vena, president and principal analyst at SmartTech Research in San Jose, California, also sees enormous abuse potential in the technology if it’s as good as Microsoft claims.
In conclusion, while Microsoft’s new application VALL-E has the potential to revolutionize text-to-speech technology, experts are also raising concerns about its performance and the potential for abuse. More research and development are needed before the full extent of its capabilities and limitations are known.