Microsoft AI Team Unveils NaturalSpeech 2: A Cutting-Edge TTS System with Latent Diffusion Models for Powerful Zero-Shot Voice Synthesis and Enhanced Expressive Prosodies

The goal of text-to-speech (TTS) is to generate high-quality, diverse speech that sounds like real people spoke it. Prosodies, speaker identities (such as gender, accent, and timbre), speaking and singing styles, and more all contribute to the richness of human speech. TTS systems have improved greatly in intelligibility and naturalness as neural networks and deep learning have progressed; some systems (such as NaturalSpeech) have even reached human-level voice quality on single-speaker recording-studio benchmarking datasets. 

Due to a lack of diversity in the data, previous speaker-limited recording-studio datasets were insufficient to capture the wide variety of speaker identities, prosodies, and styles in human speech. However, using few-shot or zero-shot technologies, TTS models can be trained on a large corpus to learn these differences and then use these trained models to generalize to the infinite unseen scenarios. Quantizing the continuous speech waveform into discrete tokens and modeling these tokens with autoregressive language models is common in today’s large-scale TTS systems.

New research by Microsoft introduces NaturalSpeech 2, a TTS system that uses latent diffusion models to produce expressive prosody, good resilience, and, most crucially, strong zero-shot capacity for voice synthesis. The researchers began by training a neural audio codec that uses a codec encoder to transform a speech waveform into a series of latent vectors and a codec decoder to restore the original waveform. After obtaining previous vectors from a phoneme encoder, a duration predictor, and a pitch predictor, they use a diffusion model to construct these latent vectors.

The following are examples of design decisions that are discussed in their paper:

  • In prior works, speech is typically quantized with numerous residual quantizers to guarantee the quality of the neural codec’s speech reconstruction. This burdens the acoustic model (autoregressive language model) heavily because the resultant discrete token sequence is quite long. Instead of using tokens, the team used continuous vectors. Therefore, they employ continuous vectors instead of discrete tokens, which shorten the sequence and provide more data for accurate speech reconstruction at the granular level. 
  • Replacing autoregressive models with diffusion ones.
  • Learning in context through speech prompting mechanisms. The team developed speech prompting mechanisms to promote in-context learning in the diffusion model and pitch/duration predictors, improving the zero-shot capacity by encouraging the diffusion models to adhere to the characteristics of the speech prompt.
  • NaturalSpeech 2 is more reliable and stable than its autoregressive predecessors as it requires only a single acoustic model (the diffusion model) instead of two-stage token prediction. In other words, it can use its duration/pitch prediction and non-autoregressive generation to apply to styles other than speech (such as a singing voice). 

To demonstrate the efficacy of these architectures, the researchers trained NaturalSpeech 2 with 400M model parameters and 44K hours of speech data. They then used it to create speech in zero-shot scenarios (with only a few seconds of speech prompt) with various speaker identities, prosody, and styles (e.g., singing). The findings show that NaturalSpeech 2 outperforms prior powerful TTS systems in experiments and generates natural speech in zero-shot conditions. It achieves more similar prosody with the speech prompt and ground-truth speech. It also achieves comparable or better naturalness (regarding CMOS) than the ground-truth speech on LibriTTS and VCTK test sets. The experimental results also show that it can generate singing voices in a novel timbre with a short singing prompt or, interestingly, with only a speech prompt, unlocking the truly zero-shot singing synthesis.

In the future, the team plans to investigate effective methods, such as consistency models, to accelerate the diffusion model and investigate widespread speaking and singing voice training to enable more potent mixed speaking/singing capabilities.

Check out the Paper and Project Page. Don’t forget to join our 26k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at

🚀 Check Out 100’s AI Tools in AI Tools Club

✅ [Featured Tool] Check out Taipy Enterprise Edition