Google recently launched Hum to Search, a new machine-learned system within Google Search that helps to find a song by humming. This approach produces an embedding of a melody directly from a song’s spectrogram without creating an intermediate representation. This allows the model to match a hummed tune to the original polyphonic recordings without a MIDI (Musical Instrument Digital Interface) version of each track or any other complex hand-engineered logic to extract the melody.
One of the significant challenges in recognizing a hummed melody is that a hummed tune often contains relatively less information; for instance, this hummed example of Bella Ciao is illustrated. The difference between the hummed version and the original version can be visualized using spectrograms, as shown below:
|Visualization of a hummed clip and a matching studio recording.|
Using the image on the left, the model needs to locate the audio corresponding to the right-hand image. To get this, the model needs to learn to focus on the dominant section of the audio and ignore background vocals, instruments, and voice timbre, and other noises. To find the dominant melody that might be used to match these two spectrograms, one can look for similarities in the lines towards the bottom of the given images.
Machine Learning Behind the feature
The initial step in developing Hum to Search is modifying the music-recognition models used in Now Playing and Sound Search to work with hummed recordings. Thus, a neural network is trained with pairs of input (here pairs of hummed or sung audio and recorded audio). Then, it produces embeddings for each input, to be used for matching later.
|Training setup for the neural network|
To recognize humming, the network should produce embeddings, which requires the pairs of audios containing the same melody to be close to each other, despite having different instrumental accompaniment and singing voices. The resultant model can then generate an embedding for a tune similar to the referred song.
Training of the model
- To train the model, the first challenge was to obtain training data. For that, Google augmented the audio during training, such as by varying the pitch or tempo of the (sung) input randomly. The model worked well enough for singing purposes, but not for humming or whistling.
- To improve the model for the required purpose, it uses SPICE, a pitch extraction model that produces a melody consisting of discrete audio tones. This generates additional training data of simulated hummed melodies from the existing audio dataset.
- This approach later replaced the simple tone generator with a neural network that produces audio resembling an actual hummed or whistled tune. For example, this is the sung (input) clip, transformed into a humming clip or whistling clip.
- Finally, the training data was compared by mixing and matching the audio samples. For example, if there is a similar clip from two different singers, it aligns those two clips with the preliminary models. Therefore, the model can have an additional pair of audio clips that represent the same melody.
|Generating hummed audio from sung audio|
However, this model needed some further changes. After applying those changes, the current system gains better accuracy on a song database that contains over half a million songs that are being updated continuously.
|Hum to Search in the Google App|
To try this feature,
- Open the latest version of the Google app.
- Tap the mic icon and ask, “what’s this song?” OR click the “Search a song” button
- You can hum, sing, or whistle.
- Hum to Search can then find and playback a song without having to type its name.