MLNews

SALMONN: ByteDance Presents a Model for Generic Hearing Ability for LLMs

Imagine a world in which AI-based models can hear, understand, and feel the emotions in voices. SALMONN is a cutting-edge AI model with the ability to recognize any speech. It can understand conversation, music, and the emotion of sound. SALMONN doesn’t just follow directions given by someone; it learns by itself. It translates unfamiliar languages, responds to spoken questions, and creates audio stories.

Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang from the Department of Electronic Engineering, Tsinghua University, and ByteDance are involved in this research study.

They have developed an application known as SALMONN. It’s similar to a smart robot that understands speech such as conversations, noises, and music. They developed SALMONN by using large language model.

Audio given as input
Audio given as input and output by SALMONN
SALMONN output

SALMONN can understand what people say, translate languages it doesn’t understand, answer questions, recognize emotions, check who’s talking, and describe music and speech. It can also perform what SALMONN was not taught about, such as filling in missing information by talking, answering questions, and telling stories with sounds.

Prior Related Work and Limitations

Text-based conversation models have been aimed at understanding spoken language. They’ve created various strategies for dealing with voice inputs and included speech tokens for LLMs for speech recognition. When it comes to audio, they consider them as fixed images, which makes speech processing difficult.

Some have also attempted to utilize LLMs for speech recognition and audio captions. For handling speech, audio events, and music, one solution known as AudioGPT was given which interacts with other models in order to work.

Fundamentals of SALMONN

Text-based large language models (LLMs) have made advances in natural language processing in prior years, achieving outstanding results at human-level performance in many applications. Instruction tuning is the process of structuring data into pairs of user commands and responses, which allows LLMs to follow open-ended human guidance.
Recent research has focused on connecting LLMs to many sorts of input, such as images, silent films, audio events, etc. They link the encoder output spaces with the LLM input space using connection modules and LLM adaptors.

Audio event detection by SALMONN
Audio event detection by SALMONN

This is a multimodal LLM capable of understanding three primary forms of sound: speech, audio events, and music. To improve performance on both speech and non-speech audio tasks, SALMONN is used as a dual encoder structure composed of a Whisper speech encoder and a BEATs audio encoder. The connection module is a window-level query Transformer (Q-Former), which converts encoder results into the number of audio tokens for the Vicuna LLM. In audio-text alignment, this system achieves excellent results.

To improve performance, they use Vicuna as a cross-modal adaptor and apply the low-rank adaptation (LoRA) strategy, which matches its augmented input space with the output space. Researchers combine speech, audio, and music activities into the window-level Q-Former and LoRA’s cross-modal. These multimodal LLMs show low or no cross-modal emerging abilities. This is known as “task over-fitting.” They suggest an additional few-shot activate tuning stage to address task over-fitting. This enables SALMONN to regain lost breakthrough abilities.

Speech Recognition
Speech Recognition

SALMONN’s hearing abilities are evaluated using many speech, audio events, and music. These tasks are divided into three categories: taught tasks based on instruction tuning, untrained speech-based NLP tasks, and tasks for comprehension of both speech and non-speech input. Their experimental results show that SALMONN is capable of performing all of these tasks effectively on standard benchmarks.

The SALMONN’s architecture, training, and performance establish the foundation for an interesting future. These include cross-modal AI models, applications in many of industries, worldwide demand for multilingual support, developments in zero-shot learning, and a critical examination. The conclusions from this data offer new innovations and tackle real-world problems in the field of multimodal AI.

Research Data and Code Accessibility

The research paper of this study is available on Arxiv. The implementation code and demo of this model are also freely available for all the people on GitHub.

Potential Feilds of SALMONN

There are many applications for the data provided on SALMONN’s design, training stages, and performance. It might help in the advancement of multimodal AI systems for practical uses. These can be used by researchers and developers in speech recognition, audio event detection, and natural language processing for more robust and flexible models. This might prove useful for voice assistants, automatic audio captioning for video content, and even cross-modal activities.

The findings provide useful assistance for dealing with task over-fitting in advanced language models. The significance of stimulation tuning and LoRA on model performance can help guide strategies for developing AI systems. Companies and developers looking at implementing more flexible and adaptive AI models in industries that include healthcare and education, content creation, and interaction with clients.

SALMONN’s Used Methodologies

Researchers introduce the model’s architecture. It includes two auditory encoders, a connecting module called Q-Former, and a large language model (LLM) with LoRA adaptors. The dual auditory encoders are composed of a speech encoder and a non-speech. Q-Former is used for photos and has been modified to accommodate audio inputs. It translates the output of the auditory encoders, with text instruction.

SALMONN output

Dual Auditory Encoders: This model combines a speech encoder and a BEATs audio encoder for non-speech audio tasks. These encoders have capabilities that are suitable for audio inputs.

SALMONN uses a window-level technique for audio inputs, as opposed to a typical Q-Former. It divides the input into windows and then converts each window into a sequence of textual tokens using Q-Former. This method is more efficient for audio series, which is particularly important for speech recognition.

LLM and LoRA: SALMONN uses a pre-trained Vicuna LLM that has been fine-tuned for the instruction. LoRA is used to change the self-attention mechanisms.

A window-level Q-Former is utilized as the connection module for combining the outputs of a Whisper speech encoder and a BEATs audio encoder as augmented audio tokens. The LoRA adaptor aligns the LLM input space with its output space. The text is used to answer open-ended questions about the audio inputs, and the answers are contained in the LLM written responses. The LLM and encoders are kept put on hold but the rest can be modified during training.

 The model architecture of SALMONN

Training methods of SALMONN:

This model goes through a multi-stage training process. Significant speech recognition and audio captioning data are used in the pre-training step. This phase includes the pre-trained model parameters and the newly included components like Q-Former and LoRA. This model is fine-tuned by utilizing text-based instructions, including voice, audio events, and music. While this improves the model’s performance on specific tasks, it presents a new difficulty called “task over-fitting.” This happens when SALMONN fails with untrained cross-modal tasks.

A tuning stage is introduced to address task over-fitting issues. The LoRA adaptor’s scaling factor is lowered, allowing it to generate wide and varied replies. This method allows the model to succeed in untrained cross-modal tasks. SALMONN is a multimodal language model that is built to accommodate a wide range of audio inputs such as voice, audio events, and music. Its architecture includes two auditory encoders, a Q-Former at the window level, and an LLM with LoRA adaptors.

Experimental Results of SALMONN

To attain its abilities includes multiple components in its design, several pre-trained models, and unique settings. It uses the Whisper-Large-v21 model’s encoder for voice, the fine-tuned BEATs2 encoder for audio, and the Vicuna model, which has 13 billion parameters. A window-level Q-Former with specified settings is used, resulting in 88 textual tokens created by Q-Former for a 30-second audio. The LoRA (Lightweight Recurrent Attention) with a rank of 8 and a scaling factor of 4.0 adapter is used. Only the Q-Former and LoRA parameters are updated during training, totaling around 33 million parameters.

Performance changes on ASR & PR (a), SQQA (b), Story (c), and SAC (d)
Performance changes on ASR & PR (a), SQQA (b), Story (c), and SAC (d)

This goes through a three-stage training process. For voice recognition, the pre-training step uses a combination of datasets, which include the 960-hour LibriSpeech training set the 1000-hour GigaSpeech M-set, and the 2800-hour WavCaps, AudioCaps, and Clotho datasets. Automatic speech recognition (ASR), automatic speech translation (AST), automatic audio captioning (AAC), and other tasks are also there in the instruction tuning stage. To reduce task over-fitting, the tuning stage involves training the model using stories based on audio samples. This stage is repeated 12 times, with each step utilizing one story sample to trigger SALMONN.

The study examines the effect of test-time discounting of the LoRA scaling factor on task over-fitting. The dropping of the LoRA scaling factor to roughly 2.0 activates the model’s cross-modal reasoning abilities and shows the presence of an underlying conditional language model in LoRA.

“Repeat
Rate” is the percentage of samples that SALMONN generated.
“Repeat Rate” is the percentage of samples that SALMONN generated.

The task over-fitting shows that the perplexity (PPL) of ASR and AAC tasks drops after the initial pre-training stage, showing the model’s adoption of cross-modal alignment. Following instruction modification, PPLs for Story and SAC remain high, highlighting the necessity for LoRA elimination.

The study also analyzes many strategies, such as ASR, Story, and QA data, and shows that training with stories or QA with extended replies increases SALMONN’s cross-modal emergent skills. The choice of activation data has an important effect on the model’s performance.
These findings show SALMONN’s ability in many tasks given the right tuning and training procedures, emphasizing the importance of its architecture and training phases.

Final Words about SALMONN

SALMONN, a speech audio language music open network proposed in this research. This might be viewed as a step toward universal hearing skills for LLMs. SALMONN, which is fitted with dual auditory encoders, achieved results on trained tasks such as speech recognition, audio captioning, and speech translation while expanding to many untrained understanding tasks such as slot filling, and speech translation for inexperienced languages.

A proposed engagement tuning step gives extraordinary abilities such as audio-based storytelling and spoken audio co-reasoning. As a result, SALMONN has proven a potential future for creating generic hearing AI.

References

https://github.com/bytedance/SALMONN

https://arxiv.org/pdf/2310.13289.pdf


Similar Posts

Signup MLNews Newsletter

What Will You Get?

Bonus

Get A Free Workshop on
AI Development