Text-to-Speech (TTS) synthesis is achieved using current voice cloning methods for a new voice. They do not, however, manipulate the expressiveness of synthesized sounds. The task of learning to synthesize the speech of an unseen speaker with the least amount of training is known as voice cloning.
UC San Diego researchers propose a Controllable voice cloning method that offers fine-grained control over many style features of synthetic speech for an unseen speaker. The voice synthesis model is explicitly conditioned on a speaker encoding, pitch contour, and latent style tokens during training.
Personal assistants for smartphones, houses, and cars can benefit from the newly established Voice Cloning technologies. Apart from that, it could aid in the improvement of voice-overs in animated films and automatic speech translation in several languages. This technology could also construct individualized speech interfaces for people who have lost their capacity to speak.
The expressive voice cloning paradigm is a multi-speaker TTS approach based on speaker encodings and speech style features. Learning a dictionary of latent style vectors known as Global Style Tokens (GST) is a prominent method of style conditioning in expressive TTS models. GSTs can learn significant latent codes when trained on a dataset with a lot of diversity in expressions. Still, when trained on a large multi-speaker dataset with predominantly neutral prosody, it only gives minimal style control.
Three main components of the framework are:
- Speaker Encoder
- Mel Spectrogram Synthesizer
In multi-speaker TTS models, speaker conditioning is often accomplished through a lookup in the speaker embedding matrix, randomly initialized and trained end-to-end with the synthesizer. The speaker embedding layer can be replaced by a speaker encoder that obtains speaker-specific information from the target waveform to tailor the multi-speaker TTS model for voice cloning. The speaker encoder can get embeddings for speakers not observed during training using a few reference speech samples.
MEL SPECTROGRAM SYNTHESIZER:
By basing the TTS synthesis model on speaker encoding and numerous style features. The researchers adopt the Mellotron synthesis paradigm to the problem of voice cloning for this purpose. Mellotron is a multi-speaker TTS model that adds extra pitch contour and speaker embedding conditioning to the Tacotron 2 GST model. The speaker embedding layer is replaced with a speaker encoded network to adapt Mellotron for voice cloning.
The synthesis model is based on Tacotron 2, an LSTM-based sequence-to-sequence model that includes an encoder that operates on a series of letters and a decoder that outputs individual frames.
The research team employed the WaveGlow model trained on the single speaker Sally dataset to decode the generated mel-spectrograms into listenable waveforms. WaveGlow has the advantage of allowing real-time inference while being competitive in terms of audio naturalness. Across all tests and datasets, the same vocoder model is employed. The vocoder model trained on a single speaker is found to generalize well across all speakers in our datasets.
While such a system can achieve promising results in cloned speech keeping speaker-specific qualities, it lacks control over other components of speech that are not encoded in the text or the speaker-specific embedding. Variation in tone, speech rate, intensity, and emotions are a few factors.
The researchers claim that the suggested framework can be used for numerous expressive voice cloning tasks utilizing only a few transcribed or untranscribed speech samples for a new speaker, based on quantitative and qualitative evaluations. The approach, on the other hand, still needs to be improved. It favors English speakers and has trouble understanding speakers with a heavy accent.