Researchers at JAIST, the Japan Advanced Institute of Science and Technology, Have Proposed a Model that Allows Voices to Mimic and Control the Generated Speech’s Speaker Identity

Voice Conversion (VC) method used to modify the speaker’s identity without altering the linguistic content. Non-linguistic information is vital for having natural (human-to-human) communication. By changing the non-linguistic information, such as adding emotion to speech, VC can make human-machine communication sound more natural. This allows people to get more information from speech and thus socialize better. 

Humans use several languages for communication, and we often need machine translators for speech-to-speech conversions. Prof. Akagi from JAIST explains that conventional (monolingual) VC models face challenges when we apply them to a “cross-lingual” VC (CLVC) task. For example, changing the speaker’s identity led to an undesirable modification of linguistic information.

Additionally, the previous model does not account for the cross-lingual differences in “F0 contour.” F0 refers to the fundamental frequency at which vocal cords vibrate in voiced sounds which is an essential quality for speech perception. Moreover, it does not ensure the aspired speaker identity for the output speech.

The JAIST researchers have recently proposed a new model suitable for CLVC that allows both voices to mimic and control the generated speech’s speaker identity. This new model marks a significant improvement over their previous VC model.

The new model is designed to apply language embedding (mapping natural language text to mathematical representations) to separate languages from speaker individuality and F0 modeling with control over the F0 contour. 

Unlike the previous model that uses a variational autoencoder (VAE) model, the proposed technique employs a star generative adversarial network (StarGAN), a deep learning-based training model. StarGAN uses two competing networks that push one another to produce improved iterations until we get the output samples indistinguishable from natural ones.

Flow chart of StarGAN training process.

The new model can be trained in an end-to-end fashion with direct optimization of language embedding during the training. It also allows reasonable control of speaker identity. The F0 conditioning helps to remove the language dependency of speaker individuality, enhancing its controllability.

Overview of processing flow of proposed model

Prof. Akagi explains the scope of direct application of their findings in protecting speaker’s privacy, adding a sense of urgency to a speech during an emergency, restoring voice post-surgery, cloning voices of historical figures, anonymizing one’s identity, and many more. He aims to improve the controllability of their model further in future research.