MLNews

LanSER: Powerful language model for Speech Emotion Recognition

Consider a world in which machines can understand your emotions simply by hearing your voice. LanSER realizes its objective by applying the power of language models to improve Speech Emotion Recognition. LanSER is the future of technology, where words will disclose the deepest feelings. In research such as Taesik Gong, Josh Belanich from KAIST, Republic of Korea, and Google Research are involved.

SER Speech emotion recognition models often rely on expensive human-labeled data for training, making techniques for scaling to huge voice datasets and complicated emotion taxonomies problematic. LanSER is a method for using unlabeled data by assuming weak emotion labels using pre-trained big language models and weakly supervised learning.

They employ a textual implications strategy to derive weak labels limited to a classification by selecting an emotion label with the highest implications score for a speech transcript collected by automatic speech recognition.

LanSER speech recognition

Related works to LanSER

SER with LLMs:

LLMs were recently utilized to produce pseudo-labels for semi-supervised learning in voice sentiment analysis. LLMs were fine-tuned on a labeled emotion dataset to investigate negative, positive, and neutral emotion classifications.

LanSER technique, on the other hand, avoids fine-tuning LLMs on task-specific datasets by assuming weak labels via textual entailment, allowing for exploration with larger emotion classifications. MEmoBERT combined audio, visual, and text information with quick learning for unsupervised emotion recognition in the context of multi-modal emotion recognition. Pre-training on huge human-annotated emotion datasets, on the other hand, is not required in this work.

Self-supervised learning:

Self-supervised learning is a popular method for pre-training that makes use of massive amounts of unlabeled voice data. Recent research has revealed that large pre-trained models learned through self-supervised learning perform well in a variety of downstream speech tasks, including numerous paralinguistic tasks. Because the two techniques can be coupled for training SER models, they see self-supervised learning and their weak supervision from LLMs as supportive.

Comparison of three weak label generation  approaches: text generation, filling mask, and textual entailment used in LanSER
Comparison of three weak label generation approaches text generation, filling mask, and textual entailment.

LanSER model

Humans infer the emotion expressed by a speaker based on both what is said (lexical content) and how it is said (prosody). Modern methods in speech emotion recognition (SER) make use of the interaction of these two components to represent emotional expression in speech. However, due to variations in real speech and the dependence on machine learning, such systems still have limits in real-world scenarios. Human assessments based on restricted emotion taxonomy. Extending Model training on large, human-labeled natural speech datasets for complex emotion taxonomies is costly and confounded by the subjective nature of emotion perception.

To deduce expressed emotion categories in textual content, they use Large Language Models (LLMs). LLMs have exhibited skills in a variety of downstream tasks, including a few subjective tasks such as social and emotional reasoning, because of the knowledge they apply from pre-training on huge text collections. LLMs have been investigated in domains such as computer vision to lessen the need for labeled data, such as visual question answering. However, they have not been explored for emotion recognition tasks, particularly from natural speech.

LanSER, which employs LLMs to deduce emotion categories from speech content (transcribed text), serves as a weak label for SER. LanSER enables pretraining an SER model on large speech datasets without human labels by 

(1) First this model extracts written transcripts from speech which is given as input to the model using ASR. 

(2) Assuming weak emotion labels with an engineered prompt and established taxonomy,

(3) Pretrain the SER model with weak labels. They show that fine-tuning LanSER on benchmark datasets enhances SER performance and label efficiency. Furthermore, they show that, despite the emotion labels being determined only from speech content, LanSER catches speech language information important to SER.

pipeline of LanSER

LanSER future scope

LanSER is an innovative technology that has the potential to transform many aspects of our lives in the near future. LanSER is ready to make our connections with computers simpler and emotionally attuned, from personalized mental health care that identifies emotional swings in a speech to revolutionizing interactions with clients by evaluating client satisfaction in real time. Its possible uses range from healthcare to entertainment to education and far more, welcoming in a new era in which technology not only understands what we say but also how we feel, resulting in a more caring and responsive society.

LanSER research paper and detail

The research material related to LanSER is available on arxiv. People who are interested in LanSER’s more details and research paper can check the link given above. This source is free for all and is available for anyone at any time. This is an open-source research and all the data is available to the public.

Potential applications of LanSER

Improving Customer Service: LanSER can be implemented into call centers to evaluate and analyze customer emotions during interactions. Companies can change their reactions and support more effectively by knowing client opinions in real time.

LanSER can be used to analyze the emotional tone of spoken or written content during the creation and storage process. It can help creators in the entertainment business, for example, determine how their work is likely to be received emotionally by the audience, allowing them to fine-tune scripts, language, and scenes for the finest impact.

LanSER can help language learners improve their emotional expression and pronunciation by offering feedback on their emotional expression and pronunciation. It may help learners not only understand the meaning of words but also accurately convey the desired emotions, thus improving their communication abilities in a foreign language.

These applications demonstrate how LanSER’s capabilities may be used to improve communication, customer experiences, content production, and many more fields in a variety of practical and different contexts.

Experimentation of LanSER

Their central hypothesis is that, given enough data, pre-training speech-only models on weak emotion labels derived from text enhance performance on SER tasks. As a result, they will concentrate on speech-only emotion identification models throughout this study. Furthermore, their goal is not to gain innovative findings on downstream tasks but to determine if models pre-trained by LanSER achieve enhanced performance given a certain model capacity.

Data preparation:

They used large-scale pre-trained speech data sets for LanSER pertaining first is people’s speech which is currently the largest English speech recognition dataset containing 30k hours of general speech second is condensed movies which are comprised of 1000 video clips from 3000 movies where they use only the audio.

For downstream tasks they used two common SER benchmarks first is IEMOCAP which is a multi-speaker database containing 5531 audio clips from 12 hours of speech and the second is CREMA-D which is linguistically constrained having only 12 sentences each sentence using six different emotions anger, disgust, fear, happy, neutral, and sad.

Prompt engineering:

They investigated the impact of various inferred weak motion labels using IEMOCAP they chose this because it has transcripts and human-rated labels with the majority referring to ground truth they computed the accuracy by comparing the weak levels with ground truth they also examined used in previous emotion recognition studies and modify a few version specific prompts. For this study, they replaced words such as photos images, and speech.

Fine-Tuning:

They fine-tuned the model on the downstream task to check their label’s efficiency and performance. Label efficiency and their varied percentage of scene training data from 10% to 100% of each data set the result in the table shows that LanSER people speech means pre-training with people speech and LanSER SCR condensed movies refer to pre-trained with condensed movies in all the cases they use BRAVE taxonomy as a label space.

Unweighted accuracy of fine-tuning for downstream tasks of LanSER with varying the percentage
Unweighted accuracy of fine-tuning for downstream tasks of LanSER with varying percentage

Zero-short classification accuracy:

The unique advantage of LanSER over self-supervised learning is that it enables your model to support zero short accuracies they use a model with randomly initialized weights and no training as a lower bound of performance. LanSER has higher accuracy in the baseline although it’s not good in fine-tuning these results tell the potential of training a large SER model that can perform well improving zero short performance for this using their proposed Framework is part of the future work.

Impact of taxonomy:

They compare the BRAVE taxonomy with the downstream tasks taxonomy. pre-training with the finer taxonomy BRAVE shows a better accuracy with a 4.2% average. This shows that fine-grained taxonomy is beneficial for effective representation of leveraging the high degree of expressiveness of LLMS.

Impact of taxonomy selection for pre-training LanSER
Impact of taxonomy selection for pre-training LanSER

Conclusion of LanSER

LanSER is a unique language-model-supported speech emotion recognition approach that takes advantage of huge unlabeled voice datasets by producing weak labels via textual consequence utilizing LLMs. LanSER can acquire excellent emotional representations, including prosodic aspects, according to their experimental results.

They have identified numerous potential topics for further research. Filtering mechanisms or changing prompts to add more conversational context, such as previous and next statements or scene descriptions, may be able to lessen the weak label noise. Furthermore, employing LLMs to produce weak labels in an open-set taxonomy may improve their expressiveness.

Reference

https://arxiv.org/pdf/2309.03978.pdf


Similar Posts

    Signup MLNews Newsletter

    What Will You Get?

    Bonus

    Get A Free Workshop on
    AI Development