Home Tech News AI Paper Summary Tango 2: The New Frontier in Text-to-Audio Synthesis and Its Superior Performance...

Tango 2: The New Frontier in Text-to-Audio Synthesis and Its Superior Performance Metrics


With the introduction of some brilliant generative Artificial intelligence models, such as ChatGPT, GEMINI, and BARD, the demand for AI-generated content is rising in a number of industries, especially multimedia. Effective text-to-audio, text-to-image, and text-to-video models that can produce high-quality material or prototypes fast are required to meet this need. It is imperative to enhance the realism of these models with respect to input prompts.

In order to align Large Language Model (LLM) replies with human preferences, supervised fine-tuning-based direct preference optimisation (DPO) has recently become a viable and reliable substitute for Reinforcement Learning with Human Feedback (RLHF). This method has been modified for diffusion models in order to match outputs that have been denoised to human preferences.

A team of researchers has employed the DPO-diffusion approach in a recent study to improve the semantic alignment of a text-to-audio model’s output audio with input prompts. They have used DPO-diffusion loss to optimize Tango, which is a publically available text-to-audio latent diffusion model, on a synthesized reference dataset. This dataset, called Audio-Alpaca, includes a variety of audio cues, along with their liked and unwanted sounds. 

While the undesired audios have defects like missing concepts, incorrect temporal order, or excessive noise levels, the preferred audios faithfully capture their corresponding written descriptions. Techniques for producing unwanted sounds include causing disturbances to descriptions and using adversarial filtering to identify sounds with bad audio quality, or CLAP-score.

Based on criteria determined by CLAP-score differentials, the team has chosen a subset of data for DPO fine-tuning in order to handle noisy preference pairs that arise from automatic synthesis. This guarantees a minimum separation between preference pairs and a minimum proximity to the input prompt. 

The team has shared that based on experimental results, Tango can be fine-tuned on the trimmed Audio-alpaca dataset to produce Tango 2, which performs better in both human and objective evaluations than Tango and AudioLDM2. Tango 2 is better able to map input prompt semantics into the audio space when it is exposed to the contrast between good and bad audio outputs during DPO fine-tuning. Even though Tango 2 creates synthetic preference data using the same dataset as Tango, it makes notable improvements, demonstrating its effectiveness. 

The team has summarized their primary contributions as follows.

  1. The study has presented a low-cost technique for producing a preference dataset semi-automatically for text-to-audio conversion. This method helps with model training by enabling the generation of a dataset where each prompt is linked to many unwanted and preferred audio outputs. 
  1. The preference dataset, known as Audio-Alpaca, has been made available to the research community. This dataset can be useful for benchmarking and more research in the future as text-to-audio generating methods are developed.
  1. Tango 2 outperformed both Tango and AudioLDM2 in terms of objective and subjective measures, even though it hasn’t sourced any more out-of-distribution text-audio pairs outside of Tango’s dataset. This demonstrates how well the suggested methodology works to improve model performance.
  1. Diffusion-DPO’s applicability has been shown by Tango 2’s performance, which highlights the technology’s potential for enhancing text-to-audio models and illustrates its usefulness in audio-generating tasks.

Check out the Paper and ProjectAll credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 40k+ ML SubReddit

Want to get in front of 1.5 Million AI Audience? Work with us here

Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.

🐝 🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...

Thank You 🙌

Exit mobile version