MLNews

Qwen-Audio: A unified Multi-Tasking Audio-Language Model

Qwen-Audio, a universal audio understanding model is here! This model support variety of tasks, language and audio types. Researchers Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou and Jingren Zhou from Alibaba Group presented this model.

This newly proposed model accepts diverse audio like human speech, natural sound, music, song and also in the form of texts as inputs and generates the result in the form of text or audio. The output tasks are translation, rephrasing or detail explanation of input.

Qwen-Audio abstract

The considerable progress of large language models has significantly driven forward developments in the realm of general artificial intelligence (AGI), owing to their robust capacity for retaining knowledge, intricate reasoning, and adeptness in problem-solving. It was a great challenge to unify all the audio processing tasks because their formats are different from each other. SpeechNet and SpeechT5 are some of the existing models. All these models lacks in performing multiple task in a single model.

For audio modality, researchers tried to utilize well-trained audio models for the training purpose such as AudioGPT and HuggingGPT. Nevertheless, these methodologies often omit vital elements such as prosody and sentiment found in human speech, occasionally struggling to transcribe non-textual audio like ambient sounds.

Lets discover about Qwen-Audio

To cater the above mentioned challenges, Qwen-Audio was introduced. A core multi-task model for audio and language that supports numerous languages, tasks and audio types. Qwen-Audio-Chat was also introduced through instruction fine-tuning. Both of these models are open-source. In Qwen-Audio the issue of diverse textual labels across various datasets by suggesting a multi-task training structure was tackled. This setup facilitates knowledge exchange and prevents interference arising from one-to-many scenarios.

The model integrates over 30 tasks, and comprehensive experiments demonstrate its significant achievement in performance. The practical implementation from one of the various tasks from this model is explained below.

The sample music input was provided to the model. Further explanation of user with the model is explained below.
Audio results

The above example show that the Qwen-Audio model is perform efficiently in both audio and text formats.

The research paper of this model is available on Arxiv and code and dataset of this model is open-source and is available on GitHub. Whereas researchers also provided the interested stakeholders with live demonstration.

Functioning of Qwen-Audio

Lets discover more about this latest model.

Architecture of Qwen-Audio

The Qwen-Audio consist of an audio encoder. The audio encoder helps to process numerous types of audio and initialization based on Whisper-large-v2 model, which is trained for translation and speech recognition. This encoder comprises of 640M. The large language model is also incorporated in this model as a basic component. The model is using Qwen-7B which is a 32 layer Transformer decoder various datasets showed variation in textual labels due to diversity of tasks. 

Multitask training format framework was introduced to deal with different kind of audios. This multitask format has Transcription Tag, Audio Language Tag, Task Tag, Text Language Tag, Timestamps Tag and Output Instruction. After going through al this process it is ensured that the output formats and different tasks should be separated to avoid the problem of one-to-many mapping. 

Performance Comparison

Qwen-Audio performance was accessed across different tasks without fine-tuning for any specific task. This model showed superior performance compared to other models. Its performance was also evaluated across diverse audio analysis tasks, encompassing AAC, SWRT, ASC, ASR, SER, AQA, VSC, and MNA.

Comparisons with other models

The results showed that this model outperformed with significant margin.

Finishing Thoughts

Qwen-Audio is a collection of expansive audio-language models possessing universal comprehension of audio. To integrate various audio types for co-training, researchers introduced a unified multi-task learning structure that fosters knowledge exchange among similar tasks and mitigates the challenges of one-to-many mapping due to distinct text formats.

The result showed that this models surpasses previous benchmarks across varied assessments, showcasing their universal grasp of audio without necessitating task-specific fine-tuning.

References


Similar Posts

Signup MLNews Newsletter

What Will You Get?

Bonus

Get A Free Workshop on
AI Development