Amazon Researchers Developed a Universal Model Integration Framework That Allows To Customize Production Voice Models in a Quick and Scalable Way

This summary article is based on Amazon research 'Scalable framework lets multiple text-to-speech models coexist'

Please don't forget to join our ML Subreddit

Alexa and other voice assistants frequently use a range of speech synthesizers, which varies in terms of expressivity, personality, language, and speaking style. The machine learning models that underpin these applications can have vastly diverse architectures. Integrating them into a single voice service is time-consuming and difficult.

A new Amazon research presents a universal model integration framework that enables quick, scalable customizable production voice models.

Modern voice models often use two massive neural networks to synthesize speech from text inputs.

  1. The first network is known as an acoustic model. It takes text as input and produces a mel-spectrogram, an image showing acoustic properties like pitch and energy of speech over time. 
  2. The second network is referred to as a vocoder. It takes the mel-spectrogram as an input and outputs an audio waveform of speech.

The team employed separate acoustic-model architectures to achieve this diversity.

Modern systems directly model the lengths of text chunks and simultaneously create voice frames. This has been proven to be more efficient and stable than using already generated frames as input. The model simply “upsamples” or repeats its encoding of a text block to align the text and speech sequences.

Developing a scalable framework that can manage them all with the constant evolution of complicated TTS models has become important.

A component that accepts an input text utterance and returns a mel-spectrogram is necessary to integrate acoustic models into production. The integration framework ensures that all components are run in the correct order. It also allows the use of various hardware accelerators based on the component versions.

Some of the issues faced in these steps are:

  • Speech is frequently generated in fragments rather than being synthesized in its entirety. Therefore, the framework should be able to return data as rapidly as feasible to reduce delay. A simplistic solution where the entire model is wrapped in code and everything is processed with a single function call will be unreasonably slow.
  • Adapting the approach to work with various hardware accelerators.

As a result, the TTS model is decoupled into a series of more specialized integration components that can handle all necessary functionality. These include adding logic to separate lengthier utterances into smaller chunks that fit specified input sizes; adding padding logic; and deciding which functionality should be handled by the model and the integration layer.

The model is encapsulated in the integration layer, which consists of components capable of converting an input utterance into a mel-spectrogram. The team used the following two components because the model normally runs in two stages — preprocessing data and creating data on-demand:

  1. A SequenceBlock that receives a tensor as input and outputs a modified tensor.
  2. A StreamableBlock that creates data on demand. It accepts the output of another StreamableBlock and/or data generated by a SequenceBlock as input.

The acoustic model is made up of:

  • two encoders (SequenceBlocks), one for encoded text and one for predicted durations that turns the input text embedding into one-dimensional representation tensors
  • an upsampler that constructs intermediate, speech-length sequences based on the encoders’ data
  • Mel-spectrogram frames are generated by a decoder (a StreamableBlock).

StreamablePipeline is a customized StreamableBlock that comprises exactly one SequenceBlock and one StreamableBlock:

The acoustic model has been available as a plugin, often known as an “addon.” Export neural networks and configuration data together make up an addon. The “stack” configuration attribute indicates how integration components should be connected to create a working integration layer.

It is possible to make simple changes thanks to the JSON format. Experiments with components that have extra diagnostic or digital-signal-processing effects are also simple to carry out. According to the framework architecture, substituting one StreamableBlock with the entire hierarchical sequence-to-sequence stack is fine.

Reference: https://www.amazon.science/blog/text-to-speech-models-coexist-thanks-to-scalable-framework