Amazon Introduces On-Device Speech Processing To Benefit “Processing On The Edge” To Its Customers

This article is based on the research article 'On-device speech processing makes Alexa faster, lower-bandwidth' by Amazon

Inventing new technology for improving customer experience remains the top priority for all companies; Amazon is not an exception. On-device speech processing is one of those technologies that the research team is working on concerning Alexa. This technology offers several advantages in latency, response time, and bandwidth utilization. In applications where internet connectivity is intermittent, all of these benefits will give the service an extra edge. It may also make it possible to combine voice signals with currently available modalities such as Natural Turn-Taking.

Storage and computational capacity are virtually limitless on the cloud. As a result, cloud models can be huge and computationally intensive to maintain accuracy. On-device execution of the identical functions means the models take up less than 1% of the space on a computer, with negligible accuracy loss. Amazon researchers have released a new setting that allows consumers to process the audio of their Alexa voice requests locally rather than sending it to the cloud.


The on-device speech recognition model receives an acoustic speech signal and generates a list of hypotheses about what the speaker said, rated by likelihood. A lattice is used to illustrate each of these hypotheses. A lattice is a graph in which the edges reflect recognized words and the probability that a given word follows from the one before it. The cloud concept sends encrypted audio in short bits called “frames” to the cloud. The lattice can’t be uploaded to the cloud until the customer has completed speaking because words spoken later in the sequence can drastically alter a hypothesis’ total probability.

A model known as an end-pointer is used to determine when the customer has completed speaking. An aggressive end-pointer will commence speech processing sooner, but it may cut the speaker off early, resulting in a negative user experience.

Two end-pointers are run simultaneously:

  • Speculative end pointer: tuned in such a way to initiate downstream processing; however, there is a minor trade-off inaccuracy.
  • Final End pointer: The downstream tasks are initiated earlier, reducing the user-perceived latency.

Another aspect of ASR is the concept of ‘context awareness.’ Since the lattice, while recording several hypotheses, falls short of encoding all potential assumptions, context awareness cannot wait for the cloud. The ASR system has to prune a lot of low-probability beliefs when building the lattice. Names of contacts or connected talents may be pruned if context-awareness isn’t included in the on-device model.

Initially, the shallow-fusion model was used to add context to a given scenario. When the algorithm constructs the lattice, it increases the likelihood of contextually relevant words, such as contact or appliance names.


Developing an end-to-end recurrent neural network-transducer (RNN-T) model that directly maps the input voice signal to a word sequence output is necessary. The memory footprint of a single neural network is significantly decreased. To attain the level of precision and compression that would allow this system to process utterances on-device, the researchers had to design novel methodologies for inference and training.

The team created strategies that allow the neural network to learn and utilize audio context inside a stream to improve the accuracy of on-device RNN-T ASR. A novel discriminative-loss and training algorithm was also implemented separately to directly decrease the word error rate (WER) of RNN-T ASR.

However, for the RNN-T to run efficiently on-device, additional compression techniques have to be developed in addition to these improvements. A neural network comprises superficial processing nodes that are connected. Weights are assigned to node connections to determine how much one node’s output contributes to the computation done by the following node.

One way to compress the memory footprint is to quantize its weights. By quantizing, one seeks to divide the complete range of importance into a small number of intervals, each represented by a single value. It takes fewer bits to specify an interval than to specify multiple discrete floating-point values.

However, one should be careful not to quantize it after the network has been trained, as it can cause untoward circumstances. As a result, the study team devised a quantization-aware training method that imposes a probability distribution on network weights during training, allowing them to be easily quantified with a minimal performance impact. Unlike earlier quantization-aware training approaches, which largely account for quantization in the forward pass, ours accounts for it in the backward direction during weight updates, using network loss regularization.

One way to make neural networks efficient is to reduce lower weights to zero. Once again, implementing this after the training would cause the networks to suffer. As a result, the team devised a sparsification strategy that allows the network to build a model amenable to weight pruning by gradually reducing low-value weights during training.

Improved on-device efficiency is also becoming more critical. With this purpose in mind, the researchers created a branching encoder network that converts audio inputs into numeric representations suited for speech categorization using two separate neural networks. One network is complicated, while the other is simple, and the ASR model evaluates on the fly whether it can save computing time and money by forwarding an input frame to the simple model.

Co-designing software and hardware

If the underlying hardware can’t take advantage of quantization and sparsification, there’s no difference in performance. The design of Amazon’s AZ series of neural edge processors, which are specialized for our specific approach to compression, was another important to getting ASR to work on-device.

Transferring data takes significantly longer than performing computations on computer chips. A matrix – a large grid of numbers — is commonly used to represent the weights of a neural network. A matrix with half of its values being zeroes takes up the same amount of space as a matrix with all nonzero values.

The matrix is reconstructed in the neural processor’s memory, with the zeroes filled back in. On the other hand, the processor’s hardware is designed to detect zero values and discard computations that include them. As a result, sparsification’s time savings are realized in the hardware.

The introduction of on-device speech processing is a significant step in bringing the benefits of “processing on edge” to our consumers. The researchers want to continue investing in this field on their behalf.