Get Ready for a Sound Revolution in AI: 2023 is the Year of Generative Sound Waves

The previous year saw a significant increase in the amount of work that concentrated on Computer Vision (CV) and Natural Language Processing (NLP). Because of this, academics worldwide are looking at the potential benefits deep learning and large language models (LLMs) might bring to audio generation. In the last few weeks alone, four new papers have been published, each introducing a potentially useful audio model that can make further research in this area much easier.


The first model is MusicLM, developed by researchers at Google and IRCAM – Sorbonne Universite. MusicLM is a model capable of producing high-quality music from text descriptions like “a soothing violin melody supported by a distorted guitar riff.” Producing 24 kHz music that is constant over several minutes, MusicLM models conditional music production as a hierarchical sequence-to-sequence modeling job. MusicLM can be trained on both text and melody. This allows the model to adjust the pitch and tempo of a hummed or whistled tune to match the tenor of a captioned text. MusicCaps is a publicly available dataset with 5.5k music-text pairs annotated with detailed human-generated descriptions.

MusicLM is specifically trained on SoundStream, w2v-BERT, and MuLan pre-trained modules. The CLIP-like model MuLan, one of the three pre-trained models, is particularly intriguing because it learns to encode coupled audio and text closer to each other in the embedding space. As mentioned in their paper “MusicLM: Generating Music From Text,” With the support of MuLan, MusicLM can overcome the problem of insufficient paired data and acquire knowledge from a big audio corpus.

馃敟 Recommended Read: Leveraging TensorLeap for Effective Transfer Learning: Overcoming Domain Gaps


Another Google research proposes SingSong, a system that can generate instrumental music audio to follow input vocal audio in lockstep. In other words, the output instrumental can be naively combined with the input vocals to create coherent music, including the input. 

SingSong takes advantage of developments in two important areas of music technology: source separation and generative audio modeling. The team divided the massive musical dataset containing 1 million tracks into aligned pairs of voice and instrumental sources by employing a commercially available source separation technique developed in prior works. They used this as parallel data for our work. Then they repurpose AudioLM for conditional “audio-to-audio” generative modeling of instrumentals given vocals by training it supervised on the source-separated data. AudioLM is an audio-generative model involving a hierarchy of intermediate representations and is, therefore, suitable for unconditional audio-generative modeling.

In their paper “SingSong: Generating musical accompaniments from singing,” the team suggests two featurization strategies for the input vocals to enhance generalization: 

  1. Adding noise to vocal inputs to hide artifacts
  2. Only using the coarsest intermediate representations from AudioLM as conditioning input. 

Together, these enhancements boost the performance of isolated vocals by 55 percent compared to the standard AudioLM enhancement. SingSong instrumentals were chosen by listeners 66% of the time when compared to instrumentals using the reference retrieval method. More importantly, 34% of the time, listeners favored SingSong instrumentals over ground truth.


A collaborative study by a group of researchers at ETH Z眉rich and Max Planck Institute for Intelligent Systems introduces Mo没sai, a text-conditional cascading diffusion model that allows us to construct long-context 48kHz stereo music based on context over the minute mark, and generates a wide range of music.聽

As mentioned in their paper, “Mo没sai: Text-to-Music Generation with Long-Context Latent Diffusion,” the researchers have employed two-stage cascading diffusion in the Mo没sai model. 

  • The first stage employs a novel diffusion autoencoder to compress the audio waveform by a factor of 64 while maintaining a moderately high level of quality. 
  • The second stage learns to generate the reduced latent representations conditioned on the text embedding generated by a pretrained language model. 

They used an optimized version of the efficient U-Net used in both stages. Their findings show that inference can be performed quickly, making the model practiced in the real world. Similarly, the entire system can be taught and run on minimal resources, such as those available in most colleges, with each stage taking around a week to train on a single A100 GPU.


The University of Surrey, in collaboration with Imperial College London, introduced AudioLDM, a TTA system that, using continuous LDMs, achieves state-of-the-art generation quality and has computational efficiency and text-conditioned audio manipulation advantages. Their work in ” AudioLDM: Text-to-Audio Generation with Latent Diffusion Models” demonstrates that with the help of a mel-spectrogram-based variational auto-encoder, AudioLDM can learn how to construct the audio prior in a latent space (VAE).

Rather than relying on language-audio pairs for training LDMs, the researchers make use of CLAP latents to facilitate TTA creation. Their experiments demonstrate that a high-quality and computationally economical TTA system can be obtained utilizing only audio input in LDM training. Their study shows that it is possible to train LDMs more effectively using only audio than audio-text data pairs. 

When tested on the AudioCaps dataset, the proposed AudioLDM outperforms the DiffSound baseline by a wide margin, achieving state-of-the-art TTA performance with a freshet distance (FD) of 23.31. This technique permits zero-shot audio changes during sampling. 


Lastly, the University of Oxford and the University of Bristol used EPIC-audio KITCHENS100 to create EPIC-SOUNDS, a massive dataset of everyday noises. EPIC-SOUNDS includes 100 hours of footage culled from 700 videos from 45 residential kitchens, with a total of 117,553 sound events. This includes 78,366 categorized sound events across 44 categories and 39,187 non-categorized sound events. Classes of sounds are created using just auditory descriptions, making them well-suited to acoustic challenges like audio/sound recognition and sound event detection.

The music-generating technology may profoundly transform music culture and redefine stakeholder economic connections. Many researchers have shown concerns about how these models present deep hazards, such as increasing access to creative engagement in music. More specifically to the subject at hand, the human voice as a singing instrument has possibly the strongest connotations with the personal identity of any musical instrument.

To avoid the drawbacks of systems that generate music from scratch or mimic the identities, many researchers believe that these models should rely on user initiative (singing) to produce music and keep the individuality of individuals intact in the output.

Researchers also believe that the recent studies will change the industry and make music creators more productive by allowing them to generate musical ideas and concepts faster, experiment with new sounds and styles and automate repetitive tasks. Additionally, human musicians bring a level of artistry and nuance to music that a machine cannot replicate. 

Check out the papers on MusicLM, SingSong, Mo没sai 2, AudioLDM, and EPIC-SOUNDS. All Credit For This Research Goes To the Researchers on This Project. Also, don鈥檛 forget to join our 13k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring the new advancements in technologies and their real-life application.