Researchers From Columbia University Propose ‘Neural Voice Camouflage’: An Adversarial Attack-Based Approach That Disrupts Automatic Speech Recognition Systems In Real-Time

This Article is written as a summay by Marktechpost Staff based on the Research Paper 'REAL-TIME NEURAL VOICE CAMOUFLAGE'. All Credit For This Research Goes To The Researchers of This Project. Check out the paper and blog post.

Please Don't Forget To Join Our ML Subreddit

Have you ever had the uneasy sense that someone is listening in on your every word? This is because it may be true. Companies have been employing “bossware” to listen to their employees while they are near their computers since the dawn of time. Several “spyware” apps are available that can record phone calls. Automatic Speech Recognition models like Amazon’s Echo and Apple’s Siri may record your daily conversation based on the voice commands. To address this critical problem, a group of researchers from Columbia University has devised a new method called Neural Voice Camouflage. The crux behind the technology is that it creates bespoke audio noise in the background as a person speaks, which confuses the artificial intelligence model that transcribes the recorded sounds. The new system utilizes an “adversarial attack” method, in which machine learning is used to change sounds in such a manner that other AI models misinterpret them as something else. In some ways, it uses a machine learning model to deceive another. This procedure, however, is not as simple as it may appear because the model must first process the entire sound clip before knowing how to change it, rendering it non-functional in real-time. Several research groups have attempted to construct robust models that can break neural networks by operating in real-time throughout the previous decade. However, they have failed to achieve both prerequisites.

As a result of their latest study, the team has successfully trained a neural network system inspired by the brain to predict the future. Over several hours of recorded speech, it was honed to process 2-second audio samples on the fly and conceal what is about to be said next. The algorithm considers what was just stated and the features of the speaker’s voice to generate sounds that interrupt a variety of conceivable words. Humans do not have any trouble recognizing the spoken words because the audio disguise sounds like background noise. Machines, on the other hand, are not the same. The technology improved the word error rate of the ASR program from 11.3 percent to 80.2 percent. Speech disguised by white noise and a competitive adversarial approach had mistake rates of only 12.8 and 20.5 percent, respectively. Even after being trained to transcribe speech affected by Neural Voice Camouflage, the ASR system’s mistake rate remained 52.5 percent. Short words were the most difficult to disrupt, as they are the minor revealing aspects of a conversation.

Source: https://openreview.net/pdf?id=qj1IZ-6TInc

As a part of quantitative studies, the researchers tested the approach in the real world by playing a speech recording mixed with the camouflage through speakers in the same room as a microphone. The strategy worked well. Because many ASR models use language models to predict outcomes, the system was also tested in that context. Compared to an ASR system with a defense mechanism, the system’s attack outperforms, making it incredibly successful at removing white noise. The team’s work was also recently presented in a paper at the coveted International Conference on Learning Representations.

According to the leading research scientist, this experiment is the first step toward ensuring privacy in the face of AI. The ultimate goal is to create technologies that protect user privacy and give people control over their voice data. Other applications that require real-time processing, such as driverless vehicles, can benefit from the concept. It is one step closer to accurately simulating how the brain works. Combining a classic machine problem of future prediction with the challenge of adversarial machine learning has led to the discovery of new study domains in the field. It may be argued that audio camouflage is desperately needed, as practically everyone today is vulnerable to security algorithms misinterpreting their speech.