Microsoft's AI speech generator achieves human parity but is too dangerous for the public

Microsoft's Vall-E 2 AI speech generator is too lifelike for public use due to misuse risks.

: Microsoft developed Vall-E 2, a neural codec language model that surpasses human parity in naturalness and speech robustness. It performed exceptionally well on LibriSpeech and VCTK datasets, making its speech indistinguishable from human speakers. However, due to potential misuse, Microsoft will not release it to the public.

Microsoft has developed Vall-E 2, an AI speech generator that uses advanced neural codec language modeling to achieve human parity in naturalness, robustness, and speaker similarity. Notably, Vall-E 2 integrates grouped code modeling and repetition-aware sampling, which improves sequence length management and decoding stability, respectively.

Vall-E 2 was tested using the LibriSpeech and VCTK datasets, and it outperformed ground truth samples, proving its remarkable lifelike capabilities. Microsoft showcased dozens of audio samples demonstrating the tool's ability to mimic human speech with incredible precision, even mastering subtle speech nuances like emphasis.

Despite its impressive performance, Microsoft deems Vall-E 2 too risky for public release, citing concerns about misuse, such as voice impersonation and spoofing. Although it will not be available to consumers, the company foresees potential applications in fields like education, translation, accessibility, and journalism.