Google AI Introduces ‘Translatotron 2’, A Neural Direct Speech-To-Speech Translation Model Without The Deepfake Potential

2810
Source: https://arxiv.org/pdf/2107.08661.pdf

Google has been working in the arena of artificial intelligence for quite some time now, and it has managed to achieve some amazing feats in the same. One such creation is its Translatotron that was released in the year 2019. The Translatotron is an AI system equipped with the capability of translating an individual’s voice directly into another language. The system worked in a manner wherein it created synthesized translations of the voices to retain the original voice of the speaker and thereby have the originality intact. But with great benefits came one significant disadvantage. This system could also create speech in a different voice and therefore got open to potential misuse. An example of the same is deepfakes.

The New System Translatotron 2 

Google has now claimed to have come up with a solution for the same with Translatotron 2. With this new AI system, the issue of misuse has been solved because it has been restricted to keep the source speakers’ voices unchanged. The translation quality and the naturalness of the sound has been improved by reducing the number of undesirable artefacts like babbling and long pauses between the speech. Not only this, this new system is better equipped and outperforms the original system by a large margin.

New Elements 

The AI researchers in their paper have further made mention of several new elements:

  • Source Speech Encoder
  • Target Phoneme Decoder
  • A synthesizer that is connected via an attention module

All these elements work complementarity; the encoder and the decoder process all the data fed to the system. Then the attention module looks into how relevant each piece of information is in the provided data. This is a systematic process after which the system generates output.

The encoder in the process creates a numerical presentation of the speech, and the decoder depicts the phenome (these phonemes are the minor units of sound that make it simpler for the system/listener to distinguish one works from another from any given language) sequencing of the produced translated speech. The synthesizer comes into play at this point when it takes the output from the decoder and the context of the production subsequently to synthesize the translated voice.

https://arxiv.org/pdf/2107.08661.pdf

Restricting Deepfakes in Translation 

Now, onto the issue of deepfakes wherein the generated speech is in a different speaker’s voice, the researchers have developed the system to retain the original speaker’s voice. For this, the researchers have taken a broader view of things and have developed a method that does not rely on the explicit and given IDs to identify the speaker (the old technique used in Translatotron). Because of the same, the researchers at Google claim that Translatotron is more appropriate and gives a better environment for producing translated speech. It mitigates any potential abuse of the AI system.

The researchers also claim that voice conversion has become an increasingly popular trend in recent years. The quality has reached such a benchmark that automatic system verifiers usually cannot detect if the speech is original or has been adulterated in any form. Therefore, this progress has to be so that the systems themselves do not allow for any form of misuse, and consequently, this new system, Translatotron 2, claims to do the same. Translation 2 is a sophisticated effort of the researchers against deepfakes as media generation techniques continue to improve. This could be a potential breakthrough in the field if successful.

Paper: https://arxiv.org/pdf/2107.08661.pdf

Project Sample: https://google-research.github.io/lingvo-lab/translatotron2/