Alibaba Researchers Introduce Qwen-Audio Series: A Set of Large-Scale Audio-Language Models with Universal Audio Understanding Abilities

Researchers from Alibaba Group introduced Qwen-Audio, which addresses the challenge of limited pre-trained audio models for diverse tasks. A hierarchical tag-based multi-task framework is designed to avoid interference issues from co-training. Qwen-Audio achieves impressive performance across benchmark tasks without task-specific fine-tuning. Qwen-Audio-Chat, built upon Qwen-Audio, supports multi-turn dialogues and diverse audio-central scenarios, demonstrating its universal audio understanding abilities.

Qwen-Audio overcomes the limitations of previous audio-language models by handling diverse audio types and tasks. Unlike prior works on speech alone, Qwen-Audio incorporates human speech, natural sounds, music, and songs, allowing co-training on datasets with varying granularities. The model excels in speech perception and recognition tasks without task-specific modifications. Qwen-Audio-Chat extends these capabilities to align with human intent, supporting multilingual, multi-turn dialogues from audio and text inputs, showcasing robust and comprehensive audio understanding.

LLMs excel in general artificial intelligence but lack audio comprehension. Qwen-Audio addresses this by scaling pre-training to cover 30 tasks and diverse audio types. A multi-task framework mitigates interference, enabling knowledge sharing. Qwen-Audio performs impressively across benchmarks without task-specific fine-tuning. Qwen-Audio-Chat, an extension, supports multi-turn dialogues and diverse audio-centric scenarios, showcasing comprehensive audio interaction capabilities in LLMs.

Qwen-Audio and Qwen-Audio-Chat are models for universal audio understanding and flexible human interaction. Qwen-Audio adopts a multi-task pre-training approach, optimizing the audio encoder while freezing language model weights. In contrast, Qwen-Audio-Chat employs supervised fine-tuning, optimizing the language model while fixing audio encoder weights. The training process includes multi-task pre-training and supervised fine-tuning. Qwen-Audio-Chat enables versatile human interaction, supporting multilingual, multi-turn dialogues from audio and text inputs, showcasing its adaptability and comprehensive audio understanding.

Qwen-Audio demonstrates remarkable performance across diverse benchmark tasks, surpassing counterparts without task-specific fine-tuning. It consistently outperforms baselines by a substantial margin on jobs like AAC, SWRT ASC, SER, AQA, VSC, and MNA. The model establishes state-of-the-art results on CochlScene, ClothoAQA, and VocalSound, showcasing robust audio understanding capabilities. Qwen-Audio’s superior performance across various analyses highlights its effectiveness and competence in achieving state-of-the-art results in challenging audio tasks.

The Qwen-Audio series introduces large-scale audio-language models with universal understanding across diverse audio types and tasks. Developed through a multi-task training framework, these models facilitate knowledge sharing and overcome interference from varying textual labels in different datasets. Achieving impressive performance across benchmarks without task-specific fine-tuning, Qwen-Audio surpasses prior works. Qwen-Audio-Chat extends these capabilities, enabling multi-turn dialogues and supporting diverse audio scenarios, showcasing robust alignment with human intent and facilitating multilingual interactions.

Qwen-Audio’s future exploration includes expanding capabilities for different audio types, languages, and specific tasks. Refining the multi-task framework or exploring alternative knowledge-sharing approaches could address interference issues in co-training. Investigating task-specific fine-tuning can enhance performance. Continuous updates based on new benchmarks, datasets, and user feedback aim to improve universal audio understanding. Qwen-Audio-Chat is refined to align with human intent, support multilingual interactions, and enable dynamic multi-turn dialogues.


Check out the┬áPaper┬áand┬áGithub.┬áAll credit for this research goes to the researchers of this project. Also,┬ádonÔÇÖt forget to join┬áour 33k+ ML SubReddit,┬á41k+ Facebook Community,┬áDiscord Channel,┬áand┬áEmail Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

­čÉŁ Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...