Google has upgraded its speech to text with the help of new deep learning models

Photo Credit- Unsplash

About a month ago, Google announced that it had made some breakthroughs when it came to its text-to-speech technology and speech to text technology. The company has been working in improving this technology since the magenta project and updating its speech to text API through a cloud service means that the Google system can generate better improvements on android phones for google results, assistance and more.

The updated service is leveraging brand-new models that can be used for a variety of commands in phone calls and video, short voice commands, search and more. It also works to provide new language support for over 120 different languages and variants thanks to new availability in the feature models for the speech to text tech.

Upgraded versions of the application can also deliver some important new features for businesses. It can soon be easy to coordinate over the phone meetings, video transcription services, support for call centers and massively improved accuracy. Even capturing audio from multiple speakers or taking commands from multiple speakers can be done much more effectively thanks to the changes in the technology. Speech to text is also eliminating more background noise than ever before to create better accuracy.

AdvertisementCoursera Plus banner featuring Johns Hopkins University, Google, and University of Michigan courses highlighting data science career-advancing content

The way that Google now captures audio media for sampling is improved. With higher bandwidths and signaled the rations it’s possible that the phone sampling can create a more optimize model for every media type that it gathers. Audio over the phone gets sampled at 8 kHz per second and videos are captured at 16 kHz per second. Improvements to the quality model can categorize these lower quality audio files to generate a better result without having to upgrade hardware.


Google also took a series of real-world audio samples to improve its model. An opt-in program called data logging gave users the chance to share audio with Google from their phone calls to improve the models in a speech to text. By giving Google technology the power of deep learning and data logging from these phone calls a reduction in errors was produced. Over 54% of the in word errors have been completely eradicated and using the standard phone call model for data collection, the speech to text technology is now 64% more accurate than it was without deep learning in video commands.

In the future, Google may continue to improve speech to text overall quality by switching over to lossless codecs like FLAC sampled around 16 kHz per second. This can improve noise reduction and automatic gain control for improving transcription over time.

Google’s speech to text is also now extremely focused on punctuation prediction. This is one of the most challenging aspects of transcription, and the speech to text API does offer automatic punctuation transcribed into the text even from long audio sequences. This was something that has never been possible before without the use of deep learning.

As Google continues to develop this technology it may be only a matter of time before we can go without the use of a keyboard.



Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.