Recently, researchers at Google introduced a neural network called LEarnable Audio Frontend (LEAF) that can be initialized to approximate mel filterbanks. It is an alternative method for crafting learnable spectrograms for audio understanding tasks. It can then be trained jointly with any audio classifier to adapt to the given task while only adding few parameters to the entire model.
In recent years, the field of developing ML models for audio understanding has seen tremendous progress. The domain has leveraged the ability to learn parameters from data and progressively shifted from composite, handcrafted systems to present deep neural classifiers used for speech recognition, understanding music, or classifying animal vocalizations such as bird calls.
Typically, neural networks for audio classification are rarely trained from raw audio waveforms and thus depend on pre-processed data in mel filterbanks form. Handcrafted mel-scaled spectrograms have been designed to replicate some aspects of the human auditory response.
Modeling Mel filterbanks for ML tasks is limited by the inherent biases of fixed features. Although matching human perception provides good inductive biases for some application domains (such as speech recognition and music understanding), these biases may be detrimental to fields for which imitating the human ear is not essential. Therefore, it is crucial for mel filterbanks to be tailored to the task of interest, requiring expert domain knowledge.
Traditionally, while creating a Mel filterbank, one needs to first capture the sound’s time-variability by windowing (cutting the signal into short segments with a fixed duration). The next step is filtering. It is done by passing the windowed segments through a bank of fixed frequency filters (that replicate the human logarithmic sensitivity to pitch). Mel filterbanks give extra attention to the low-frequency sound range as humans are more sensitive to low-frequency variations than high frequencies. Lastly, the audio signal is compressed to mimic the ear’s logarithmic sensitivity to loudness.
A Parameter-Efficient Alternative to Fixed Features
LEAF loosely follows the traditional approach to generate Mel filterbank. However, it replaces each of the fixed operations (i.e., the filtering layer, windowing layer, and compression function) with a learned counterpart. LEAF’s output is a time-frequency representation (a spectrogram) similar to mel filterbanks but entirely learnable. Therefore, LEAF learns the scale that is best suited to the task of interest. Models that have been trained using Mel filterbanks as input features can also be trained on LEAF spectrograms. LEAF can be initialized randomly and also to approximate mel filterbanks, which are a better starting point. It can be trained using any classifier to accustom to the task.
Replacing fixed features involving no learning parameter with a trainable system can significantly increase the number of parameters to optimize. To avoid this problem, LEAF uses Gabor convolution layers that have only two parameters per filter, unlike standard convolution layers that have more than 400 parameters. This way, the LEAF model only accounts for 0.01% of the total parameters even when paired with a small classifier.
The team applied LEAF to diverse audio classification tasks. They observed that the proposed method outperforms both Mel filterbanks and previous learnable frontends, such as Time-Domain Filterbanks and Wavegram.
While Mel filterbanks achieve an average accuracy of 73.9%, LEAF attained an average accuracy of 76.9% across the different tasks. Additionally, it is also possible to train LEAF in a multi-task setting so that one LEAF parametrization can work well across all the tasks. LEAF combined with a sizeable audio classifier achieved SOTA performance on the challenging AudioSet benchmark, with a 2.74 d-prime score.
The researchers believe that the scope of audio understanding tasks keeps growing, and adapting mel filterbanks to every new task would require a notable amount of hand-tuning and experimentation. Instead, the new method LEAF provides a drop-in replacement for these fixed features with minimal task-specific adjustments. The team hopes that LEAF will be helpful in the development of models for new audio understanding tasks.