Researchers at Google AI recently introduced Lyra, a high-quality, very low-bitrate speech codec that makes voice communication available even on the slowest networks. The researchers have applied traditional codec techniques with machine learning (ML) models trained on thousands of hours of data to create a unique method for compressing and transmitting voice signals.
Voice and Video calls are an essential part of our everyday life. The real-time communication frameworks used to make this possible depend on dynamic compression techniques like codecs to encode (or decode) signals for transmission and storage. Codecs allow bandwidth-hungry applications to transmit data efficiently.
Persistent challenges in developing codecs (both for video and audio) are increasing quality, useless data, and minimizing latency for real-time communication. Today, video codecs can reach lower bitrates than some high-quality speech codecs used.
When combined with speech codecs, low-bitrate video codecs delivers a quality video call experience even in low-bandwidth networks. However, it has been observed that the voice signal sounds more robotic for the lower bitrate of an audio codec. Additionally, consistently high-quality, high-speed network is not accessible by all, and even those in well-connected regions sometimes experience poor quality, low bandwidth, and congested network connections.
The basic architecture of the Lyra codec
Features (distinctive speech attributes) are extracted from speech every 40ms and are compressed for transmission. The features are Log Mel Spectrograms (a set of numbers representing the speech energy in different frequency bands). Traditionally, they are modeled after human auditory response and used for their perceptual relevance. Contrarily, generative models use these features to recreate speech signals. This makes Lyra very similar to the traditional parametric codecs.
Traditional parametric codecs extract from critical speech parameters that can then recreate the signal at the receiving end, achieve low bitrates. But, often, they sound robotic and unnatural. The new generation of high-quality audio generative models (such as WaveNet and WaveNetEQ) has revolutionized the field by distinguishing between signals and generating completely new ones.
A novel way to compression with Lyra
Using these models as a baseline, the team has developed a new model capable of reconstructing speech using minimal data. Waveform codecs achieve this high quality by compressing and sending over the signal sample-by-sample, which requires a higher bitrate and, in most instances, is not essential to attain natural-sounding speech. On the other hand, Lyra maintains the low bitrate of parametric while achieving high quality by harnessing these new natural-sounding generative models’ power.
Yet, generative models are computationally complex. Lyra thus uses a cheaper recurrent generative model, a WaveRNN variation, to avoid this concern. WaveRNN works at a lower rate but generates multiple parallel signals in different frequency ranges. It later combines these into a single output signal at the desired sample rate.
This method allows Lyra to run on cloud servers and also on-device on mid-range phones in real-time. Then, this generative model is trained on thousands of hours of speech data and optimized to recreate the input audio accurately.
The team has trained Lyra with thousands of hours of audio with speakers in over 70 languages using open-source audio libraries. They verified the audio quality with expert and crowdsourced listeners. The team states that one of the design goals of Lyra is to ensure universally accessible, high-quality audio experiences. Lyra trains on a broad dataset to make sure the codec is robust to any state.
Currently, Lyra is designed to operate at 3kbps. Listening tests reveal that Lyra outperforms any other codec at that bitrate and achieves more than a 60% bandwidth reduction. Lyra can be used in places with low bandwidth conditions for higher-bitrates, and the present low-bitrate codecs do not give satisfactory quality. Now users can have an efficient low-bitrate codec that allows them to have higher quality audio than ever before.
The team plans to optimize Lyra’s performance and quality to ensure maximum technology availability, with investigations into acceleration via GPUs and TPUs. They also aim to research how these technologies can lead to a low-bitrate general-purpose audio codec such as music and other non-speech use cases.