MLNews

MuAViC: 9 different languages for Speech Analysis and Speech-to-Text Translation.

MuAViC, where words and images meet and experience the wonder of strong speech recognition and speech-to-text translation like never before. It’s a secret pearl that includes 1200 hours of audio-visual discussion and nine amazing languages of human communication. Meta AI is involved in the research of MuAViC They make a model that can analyze speech and can translate Speech-to-text.

MuAViC is abbreviated as Multilingual Audio-Visual Corpus used for effective speech recognition and text-to-speech translation. The collection contains over 1200 hours of translated audio-visual speech from over 8000 speakers in 9 languages these include English (En), Arabic (Ar), German (De), Greek (El), Spanish (Es), French (Fr), Italian (It), Portuguese (Pt) and Russian (Ru). This makes MuAViC the most complete open benchmark for international audio-visual speech recognition (AVSR) and lipreading so far. They also present text translations and set starting points for six English-to-X translation directions as well as six X-to-English translation directions. MuAViC is, so far, the first publicly available sample for audio-visual speech-to-text translation.

MuAViC: Multilingual Audio-Visual Corpus model for speech recognition and apeech to text conversion
Illustration of MuAViC

Evolutionary steps towards MuAViC:

In past the model like Auditory-visual speech recognition (AVSR), which converts spoken words utilizing both sight and hearing inputs, has been proven to improve speech recognition reliability. Deep learning has significantly improved the performance of AVSR systems. It reduces the word error rate by 3 times compared to its audio-only version. But this model is highly benchmarked under signal language setup in which the majority is based on English only. This model lacks large-scale audio-visual datasets.

Audio-Visusal speech-to-text translation (AVST) translates the visual content and audio content after recognizing them this model converts that speech into text and gives output in text format. This model works using the processed data given by the AVSR model but there are lacking in the AVSR model so like in the AVSR model it brings a noise-robust effect in AVST also.

MuAViC model resolved all these problems. MuAViC is the first benchmark for audio-video speech translation and the largest multiple-language benchmark for audio-video speech recognition. It contains approximately 1,200 hours of data in different 9 languages which include English (En), Arabic (Ar), German (De), Greek (El), Spanish (Es), French (Fr), Italian (It), Portuguese (Pt) and Russian (Ru).

In addition to different language support, they also provide text translations and also provided the baseline for 6 English-to-X translations as well as 6 X-to-English translation directions.

Role of MuAViC in shaping the future

The future of MuAViC offers enormous opportunities and has the capacity to alter different fields. This model will give a deeper understanding of both audio-visual data recognition and speech-to-text translation for future models. This AI model will break all the barriers of language between different people who don’t know each other language and they will understand their talks easily.

MuAViC in real time will become a standard function. This model will allow for real-time, on-the-fly translation during chats, conferences, and even in virtual environments. This will have far-reaching impacts on global cooperation and availability.

MuAViC will be critical in increasing accessibility for those with hearing impairments. They will be effortlessly incorporated into numerous assistive technologies, increasing the accessibility of communication and information.

In an Augmented reality and virtual reality environment, MuAViC will enhance the experience by providing real-time subtitles and translations. This integration will be important in overcoming virtual world cultural and linguistic divides.

Accessibility and Research Studies

The detailed research paper on MuAViC is available for the public on arxiv.org and for code and how to implement this model all the further details are given on GitHub. All the content is open source and free for public use. This research study is given in a way that people will understand it easily but the code and technical parts are researchers and the people who understand are still open for all the people who are interested.

Exploring MuAViC versatile applications

There are many potential applications of MuAViC in different fields some of the applications are given below:

Multilingual Communication Breakthrough: MuAViC’s multilingual qualities will revolutionize cross-cultural communication, making it simple for people from diverse language backgrounds to comprehend each other smoothly. It will be used in foreign diplomacy, commercial discussions, and casual talks.

Real-Time Translation in Chats and Conferences: MuAViC’s real-time translating feature will become crucial in chat applications, virtual meetings, and conferences. It will break down linguistic boundaries, promoting global collaboration and understanding.

Usability for Hearing-Impaired Persons: MuAViC will play a critical role in improving accessibility for people with hearing impairments. It can be integrated into assistive devices, making audio content accessible to the deaf and hard of hearing through text.

In the Education department: MuAViC can be a great educational tool, providing subtitles and translations for online courses and educational films. It will assist students all throughout the world, particularly those learning a language that is not their native tongue.

Improved Human-Computer Interaction: MuAViC’s increased speech recognition and translation capabilities will result in more natural and successful human-computer interactions, benefiting virtual assistants, customer care chatbots, and automated language services.

Healthcare: MuAViC can help in multilingual patient-doctor communication in healthcare, making medical advice and information more accessible to a wider audience. Medicine consultations can be done across language barriers with ease.

Language learning: MuAViC can be integrated into language learning apps and platforms to provide real-world language practice and improve pronunciation.

Corpus Creation in MuAViC

MuAViC source data from prior models talk recordings either native or non-native speakers conduct public speeches on stage and cameras capture stage scenes rotating between multiple angles. We extract audio and video tracks from recordings and match them with human transcriptions and text translations.

For English speakers, They utilize a text-matching method to align audio-visual data from the LRS3 datasets with a machine translation collection. For translation labels, matched samples are paired with the matching target sentences in the machine translation corpus. To ensure the highest accuracy, we use exact text matching for development and test set examples. We obtain pseudo-translation labels from a machine translation model for training set samples that do not match.

Same visual thing with different languages name with the help of MuAViC
The same visual thing with a different language’s name

For non-English speakers: They repurpose the speech translation datasets with audio-only data, transcriptions, and text translations. They capture video tracks from the original recordings and align processed video data with the audio data to create audio-visual data. Despite the fact that all audio data is transcribed, only a portion of it is translated. Using the same machine translation model as before, we obtain pseudo-translation labels.

MuAViC model training

They created from start to finish audio-video speech recognition and audio-video speech translation models using Meta’s AV-HuBERT architecture. their model can process both formats and combine their representations into only one space that can be used for either speech recognition or translation tasks when given a matched pair of audio-video data. In addition, if one of the modes is absent, AV-HuBERT can still handle the given input mode (although with less efficiency).

The most important attribute of their model is its ability to withstand noise. If the audio mode becomes corrupted due to noise or other factors, the model will depend more on the visual modality to complete the task correctly. They compared their models against a state-of-the-art model for speech recognition and X-En speech translation tasks in both noisy and quiet contexts.

This chart compares MuAViC model performance on speech recognition tasks spanning 9 different languages. Meta's AV-HuBERT model doesn't degrade significantly in noisy environments, while the current state-of-the-art model does.
This chart compares MuAViC model performance on speech recognition tasks spanning nine different languages. Meta’s AV-HuBERT model doesn’t degrade significantly in noisy environments, while the current state-of-the-art model does.
Similarly to MuAViC the performance of Meta’s AV-HuBERT model does not significantly degrade compared with that of the state-of-the-art model, using the X-En speech translation task spanning six different languages.
Similarly, the performance of Meta’s AV-HuBERT model does not significantly degrade compared with that of the state-of-the-art model, using the X-En speech translation task spanning six different languages.

Concluding remarks on MuAViC

They present MuAViC, a multilingual audio-visual corpus with 1200 hours of speech in 9 languages. It is the first open benchmark for multilingual audio-visual speech recognition and the largest open benchmark for audio-visual speech-to-text translation. This model uses Meta’s AV-HuBERT architecture. This model can handle both formats audio and video if any one of the modes is absent AV-HuBERT can handle the given input mode. The given image shows the raw data from 9 different languages and there duration in training the MuAViC model.

Raw data collection for MuAViC
Raw data collection for MuAViC

Reference

https://github.com/facebookresearch/muavic

https://arxiv.org/abs/2303.00628

https://ai.meta.com/blog/muavic-audio-visual-speech-translation-benchmark/


Similar Posts

    Signup MLNews Newsletter

    What Will You Get?

    Bonus

    Get A Free Workshop on
    AI Development