MLNews

HK-LegiCoST: Speech Translation of 600+hours audios of Cantonese into English.

Unleash the potential of speech translation, and rethink the future of language understanding. The voices where you go deep into speech recognition and translation, where every syllable holds the weight of a thousand emotions. The world where words become bridges and comprehension has no boundaries.

Cihan Xiao, Henry Li Xinyuan, Jinyi Yang, Dongji Gao, Matthew Wiesner, Kevin Duh, and Sanjeev Khudanpur are the researchers in the research study of HK-LegiCoST.

HK-LegiCoST, Hong Kong Legislative Council Speech translation a new three-way parallel corpus of Cantonese-English translations. Cantonese is a language within Chinese which is originated from Guangzhou city. The collection is remarkable in that it contains over 600 hours of Cantonese audio, as well as regular traditional Chinese scripts and English translations. The data has been divided and synced at the sentence level, which means that each sentence in the audio is matched with its corresponding Chinese and English transcript.

HK-LegiCoST: Speech Translation excludes all unnecessary speech without editing or changing the meaning of speech.

Prior year’s work on Speech Translation

Previously, the most successful Speech Translation (ST) systems were produced by combining Automatic Speech Recognition (ASR) systems with Machine Translation (MT) systems. These require a special kind of training data that involves audio recordings and their translated transcripts to the target language.

Even though this work originally started with languages that had plenty of resources like Spanish-English and English-French, conferences like The International Conference on Spoken Language Translation helped build it up. Then came a shift that aimed to include more languages. Even those with fewer resources available.

For example, the MuST-C dataset contained English speeches paired with transcripts translated into eight different languages. Recent projects such as CVSS and FLEURS expanded the number of languages involved and the amount of data available for research on this topic. Although its Cantonese portion wasn’t big enough to train competitive speech translation tools, FLEURS was one of the first notable datasets for Cantonese speech translation.

While collecting, transcribing, and translating speech is expensive and challenging, some projects like EuroParl-ST and Multilingual TedX have used publicly available audio or video data.

Speech translation with HK-LegiCoST

Hong Kong Legislative Council also known as HK-LegiCo is mainly created from all the council’s regular meetings from 2016 to 2021 which cover a range of topics such as political performance education policy housing issues and many more issues of the country. It accepts the original video of each meeting and its response and English translation in PDF format.

Automated video captioning and language learning apps that rely on advanced speech-to-text translation (ST) are in high demand. The primary goal of ST is to transcribe spoken words into text in another language. The majority of research conducted in this field has concentrated on languages with abundant resources and neglected less-spoken tongues or those with substantial differences between spoken and written versions.

Cantonese is an ideal example of such a language, with a written form that closely resembles Mandarin. The presence of non-verbatim transcripts, which create obstacles for automated speech recognition (ASR) and ST systems, is the root of these variations.

Architectural diagram of HK-LegiCoST

In this research, they introduce HK-LegiCoST which is a new Corpus (collection of texts, words, or main part of the written text) of Cantonese audio recording and there are also corresponding Chinese transcripts and English textual translations. There are 600 + hours of recording collected by the Hong Kong Council they mainly focus on government policy-related inquiry and their responses, their discussions, and debates on motion and resolution for the collection of audio data.

To overcome the challenge of converting this research into a large and useful Corpus for language research, they provided automated speech recognition, machine translation, and speech translation as a baseline for the corpus.

To fine-tune the baseline of this model HK-LegiCoST they use the FLEURS dataset to perform better than the previous baselines. They train the speech translation model on this new corpus. They believe that this corpus will become an important resource for studying speech recognition and translation.

HK-LegiCoST future potential

The future potential of HK-LegiCoST Corpus is boundless. As the technology advances this resource will continue to fuel groundbreaking research in speech recognition and translation. 

This will also help not only Cantonese but also the languages that are facing the same challenges. The vast collection of dialectal speech will enable to development of a more accurate automatic speech recognition system. This model will reduce the gap between spoken and written forms of languages this model will also empower the creation of different speech-on-translation models and unlock new avenues for cross-linguistic understanding and communication.

Access and research study material of HK-LegiCoST

The public can have easy access to HK-LegiCoST Research and announcements. The research paper related to this model is available on isca-speech, This is an open-source paper repository where anyone can obtain in-depth information about this new technology.

The code of model training and fine-tuning is available on GitHub anyone interested in the deep study of this model can check these links and can contribute to the development of this model and can have accessibility to this open-source

Potential fields of HK-LegiCoST

The HK-LegiCoST corpus have the potential can be used in many fields for example in speech recognition and translation system this model can be used to train and improve automatic speech recognition systems for different languages enabling better transcription of spoken and non-verbatim language learning and teaching field it can benefit by providing of real word spoken data.

In Transcription Services this corpus can support converting spoken into written text accurately and this will also be useful in the media and entertainment industry for subtitles and captions.

HK-LegiCoST can also be used in government and policy analysis it can be utilized for analyzing political discourse and policy-making processes can also be used in accessibility services this can be used to develop accessibility services for hearing impairments to enable them to access spoken language content into accurate transcription.

HK-LegiCoST also involves voice assistants and chatbots that can also be used as cross-cultural communication to break down the language barrier and promote cross-cultural understanding however it can also be used in educational resources to develop education materials which include language resources for studying different languages.

How HK-LegiCoST corpus is created

Corpus is generated by passing through different stages like collecting data, and alignments of the sentences which are explained below

Data collection for corpus creation

The raw data for the Corpus has been collected from the video recordings of the Hong Kong Legislative Council regular meeting they also provide their corresponding transcripts and their English translations in PDF format.

The raw recordings contain visual information such as lip movements and the sign languages used in speech so researchers plan to release a new version of the Corpus with visual information. Text processing is used to filter out all the irrelevant information in the PDF transcript and divide the full document into short parts. Firstly model will extract raw text from the transcript and translation file and paragraph with a Chinese speech marker this whole process will automatically divide the full document into different labeled segments and allow them for more efficient Biotech and audio text alignment.

The first thing is audio text alignment using ASR. ASR was trained to understand the language using parts of big collections of voice recordings called CommonVoice Corpus this program uses a specific structure called conformer to learn and it is trained in a way that it can understand both spoken words and written words sometimes it’s difficult when people right something different than how they say so that’s why they pick CommonVoice corpus and because the written words are more likely to be in standard Chinese.

To make it easier for us HK-LegiCoST broke the Chinese words into small parts and treated English words and numbers as a single part during the learning process. This model also changes the Chinese words into a system called Jyutping which is a way to write Cantonese using the Roman alphabet. To make all this work they use a toolkit called K2 that helps the computer to understand and process different kinds of information. 

First pass alignment process, they align the audio segment which is not properly matched with the corresponding parts of the transcripts using an anchor-based method. HK-LegiCoST uses a tool of voice activity detection Silero-VAD which is used to extract titles from audio segments for further processing this tool helps to identify when there is a speech in the audio.

A sample G graph for flexible alignment of HK-LegiCoST

In the process of sentence level alignment within the audio and transcript there are two key challenges that were faced first is when there is extra text in the transcript that was not spoken during the meeting and making them align is more complicated and the second is a longer audio segment lasting over 10 minutes recording was difficult due to memory limitation to handle these issues flexible alignment algorithm with a sliding window of approach was implemented.

This algorithm effectively divided long segment into smaller parts making them more manageable while filtering out not spoken text from the script the flexible alignment use linear Finite Set Automation (FSA) for decoding and include a feature that allows them to skip sentence that contains irrelevant text and focus on the text that is spoken.

In the process of post-filtering, the ASR decoder was used to assess alignment quality by measuring parameters like character error rate (CER) the number of consecutive errors, or error ratio. To ensure that the results are optimally aligned they established a threshold based on CER creating a bin that categorizes Alignment quality. Then 300 random utterances from the corpus based on these bins and manually labeled to fine-tune the threshold. This process is used to optimize the subset and filter where speech deviates from the script due to repetitions or other disorders.

The perplexity of the dev and test sets
is computed based on a 3-gram language model trained on the
training split.

Final remarks about HK-LegiCoST

The HK-LegiCoST corpus was designed for the study of local speech recognition and translation. The original data came from publicly available Hong Kong Legislative Council meeting recordings and transcripts. With 518 hours of Cantonese speech and 142k phrase pairs, this dataset is one of the largest for Cantonese ASR and Cantonese-English speech translation.

They present some of the semantic characteristics found in their sample, such as text reordering and extra context reliance. The results of their baseline experiments suggest that their corpus is validated and useful for ASR and ST research. Furthermore, their approaches successfully solve some of the most difficult difficulties for building a corpus from the start, namely dividing and matching long recordings with non-verbatim texts.

References

https://www.isca-speech.org/archive/pdfs/interspeech_2023/xiao23d_interspeech.pdf

https://github.com/BorrisonXiao/icefall/blob/master/egs/hklegco/ASR/train.sh


Similar Posts

    Signup MLNews Newsletter

    What Will You Get?

    Bonus

    Get A Free Workshop on
    AI Development