Coqui Introduces ‘YourTTS’: A Zero-Sample Text-to-Speech Model With State-of-The-Art (SOTA) Results


Recent advancements in end-to-end deep learning models have enabled new and intriguing Text-to-Speech (TTS) use-cases with excellent natural-sounding outcomes. However, the majority of these models are trained on large datasets recorded with a single speaker in a professional setting. Expanding solutions to numerous languages and speakers is not viable for everyone in this situation. It is more challenging for low-resource languages not often studied by mainstream research.

Coqui’s team has designed ‘YourTTS‘ to overcome these limits and provide zero-shot TTS to low-resource languages. It can synthesize voices in various languages and drastically reduce data requirements by transferring information between the training set. 

YourTTS builds on top of the VITS (Variational Inference with Adversarial Learning for End-to-End Text-to-Speech) model, which serves as the backbone architecture. Compared to the original model, they have used a larger text encoder. 

VITS is a unique TTS model. It combines a variety of deep-learning approaches (adversarial learning, normalizing flows, variational auto-encoders, and transformers) to produce high-quality natural-sounding output. It’s based primarily on the GlowTTS model. The GlowTTS is small, robust to large sentences, converges quickly, and is theoretically sound because it maximizes the log-likelihood of speech with alignment. The output’s major flaw is its lack of naturalness and expressivity.


VITS enhances it by offering new features. To begin, it substitutes a stochastic duration predictor for the duration predictor, which better represents speech variability. Then it connects a HifiGAN vocoder to the decoder’s output and uses a variational autoencoder to merge the two (VAE). This allows the model to train end-to-end and find a more accurate intermediate representation than typical Mel-spectrograms, generating accurate and precise prosody.  

Further, to transfer speaker information to the remainder of the model, YourTTS uses an independently trained speaker encoder model to compute the speaker embedding vectors (d-vectors). The speaker encoder architecture is based on the H/ASP concept. 

The researchers integrated multiple datasets for different languages including, VCTK and LibriTTS for English, TTS-Portuguese Corpus (TPC) for Brazilian Portuguese, and the French component of the M-AILABS dataset (FMAI). 

They train YourTTS in stages. They started off with a single English speaker dataset and gradually added more speakers and languages. Later, they fine-tune the final model for each dataset using speaker encoder loss (SCL). With cosine similarity loss, SCL compares output speech embeddings to ground truth embeddings produced by the speaker encoder.

To assess the model’s performance, the researchers use “mean opinion score” (MOS) tests and “similarity MOS” tests. They also used speaker encoder cosine similarity (SECS) to compare predicted outputs to actual audio clips of a target speaker. The results of YourTTS were compared to those of AttentronZS and SC-GlowTTS, which show that YourTTS has a higher Mean Opinion Score (MOS) than the actual speech clips in the dataset in many cases, YourTTS exceeds them.

The findings show good MOS values on zero-sample speech conversion, whether it is English speech or Portuguese or male and female voices. After several tests, the team discovered that YourTTS only requires 20 seconds of the speaker’s speech to change the model to provide high-quality speech output in the speaker’s voice.