MLNews

Convert any Text to Speech using Acoustic Units

Consider a world where languages combine easily, where the words you write can be spoken with emotion and honesty in any language. This vision is brought to life through this revolutionary system. It converts text to speech, in many different languages, and it does so with emotion. This work was performed by using HPC resources from GENCI-IDRIS.

This study offers a method that can directly convert text into speech in another language. To do this, they utilize separate sound elements. Instead of using written words in the target language, they can create a voice in the target language using text from several source languages. They employ text as the source input to generate discrete acoustic units as intermediate representations. As a result, utilizing this text structure, they may generate the same discrete units as if we were using speech as input.

The success of applying sound elements in past projects that directly transformed speech into speech motivated them. To find these sound parts, they first employ a speech encoder and a grouping approach in their system. Then they train a computer program to foresee these sound elements. Finally, a vocoder converts these pieces into speech.

Text to speech model

Works related to Text to speech model

Over the last few years, the massive rise in available unlabeled data for text and audio in all languages all over the world has prompted the development of strong new ways to process this data. Furthermore, recent developments in self-supervised learning have enabled us to benefit from this data and generate general-purpose models.

These models can be used for a variety of applications and languages, such as speech processing with XLS-R or text processing with mBART and mT5. Also, several recent efforts have been focused on the creation of multilingual and multimodal systems, such as mSLAM and SAMU-XLSR. These systems aim at reducing communication issues between persons who speak and write in different languages, particularly in deficient resource languages.

Previous research has shown state-of-the-art performance on a wide range of text and voice downstream tasks, including machine translation, particularly for text-to-text and speech-to-text translation. These challenges aim to translate speech and text generated in one language into speech in another. In the case of traditional speech-to-speech translation, systems use a cascade technique to convert speech to text using Automatic Speech Recognition (ASR), followed by text-to-text machine translation or a speech-to-text system.

Text to speech auto recognition

Text to speech system using acoustic units

In addition to the direct speech-to-speech system, these studies have included a Text to speech translation component that makes use of discontinuous acoustic units. These studies, on the other hand, apply text-to-unit translation to the output of an ASR system. As a result, the suggested method is not a direct Text to speech system because it does not use an original text input to generate the output speech. Furthermore, the usage of the ASR module output as input, the quality of which may affect the working steps, could affect the performance of this system.

Unlike prior research, which only utilized text-to-unit translation systems in addition to ASR, this study presents the creation of a framework for generating speech in a particular language from text input in a different language. After that, the work can be officially characterized as a direct Text to speech translation task. Using this system, they employ text as the source input to generate discrete acoustic units as intermediate representations. As a result, utilizing this text structure, they may generate the same discrete units as if we were using speech as input.

The use of this framework could be beneficial for a variety of real-world applications. Text to speech translation, for example, could be employed to improve data for low-resource languages or to create audio recordings of textual information, such as podcasts or story-telling services, from texts.

Text:

She also defended the lord chancellor’s existing powers.

Speech:

Furthermore, they investigated the impact of employing two pre-trained models with a wide range of languages as encoder-decoders for fine-tuning direct Text to speech systems in a new corpus termed Common Voice-based Speech-to-Speech (CVSS) translation.

This new CVSS dataset was just provided to solve limitation difficulties in end-to-end labeled data for direct conversation-to-speech and Text to speech translation. Furthermore, the number of languages in past similar research has been limited to largely high-resource languages with 10 different languages.

The text-to-speech translation task, on the other hand, has been evaluated on more than 20 input languages using this new dataset.

Future of Text to speech Model

The direct Text to speech translation technology based on acoustic units is set to change global communication in the not-too-distant future. As this technology advances, it has the potential to smoothly unite cultures and languages. Imagine a world where language boundaries are a thing of the past, where individuals from all over the world are able to speak fluently in their own tongues.

This system, with future development, might become a vital tool for international diplomacy, trade, and education, enabling cross-cultural understanding and collaboration as simple as typing a message. It has the ability to provide people with the chance to express themselves more truly allowing their actual voices to be heard around the world.

Furthermore, the inclusion of emotional aspects into the direct text-to-speech translation process offers fresh possibilities for entertainment and storytelling. We may see the birth of deeply felt AI narrators capable of bringing novels, scripts, and interactive experiences to life in ways we never imagined possible in the future.

These narrators will not only deliver the words on the page but will also fill them with the author’s intended emotional depth and passion. This breakthrough has the potential to transform the entertainment business by generating immersive audiobook experiences, realistic virtual performers, and interactive storylines that respond to the emotions of the audience.

Research material of Text to speech model

Anyone interested in learning more about Text to speech conversion can visit Arvix at any time. There is related information utilized in the experiments and demonstrations throughout the research report. There is no restriction in accessing that material concerning Text to speech conversion because it is available at any time and from any location.

Potential applications of Text to speech model

Access Multicultural Education: This technology has the potential to transform the way languages are taught and learned. Students from all over the world could benefit from high-quality language education, with the system providing real-time realistic pronunciation and emotion.

Accessible Information for the Sight Disabled: The system has the potential to make information more accessible to the blinds. Text from books, journals, or websites could be converted into spoken words with a natural tone, allowing the blind to independently access a greater range of written material.

Human-Like Artificial Secretaries: This tool could be used by businesses to deliver personalized, sympathetic consumer confrontations. This technology could let clients and AI-powered service representatives communicate more effectively and enjoyably.

Cultural Maintenance: This system’s preservation skills may help local and languages that are endangered. Linguists and supporters of communities could use it to transform written documents into spoken form in order to follow and preserve linguistic culture.

Proposed Methods for Text to speech model

Direct Speech-to-Speech Translation:

Two separate blocks are combined to construct the system. A multilingual Hidden unit BERT (mHuBERT) is used first to extract representations from the target speech, which are afterward divided using a quantizer model. mHuBERT was chosen as the generator because it outperformed other unsupervised models on a variety of speech tasks. Using this method for extracting discrete units, the encoder-decoder voice-to-unit translation model can be trained using the units as the goal sequence. The target speech is generated from the discrete units in a subsequent step when the model has been trained.

Direct Text to speech transition:

The direct Text to speech translation system uses an encoder-decoder design. Because converting text inputs to acoustic units is a machine translation task, they employed a pre-trained text model as the initialization for encoder-decoder architecture. They specifically investigated the multilingual BART (mBART) model in its two variants, mBART25 and mBART50. The primary distinction between the two models is the number of languages used in the training phase. The complete architecture is fine-tuned on the text to acoustic unit translation responsibility after initialization. The units used as targets in this training were previously extracted using an auditory unit-finding system.

Finally, the HiFi GAN unit to speech vocoder is used to create target speech outputs. This unit-based vocoder is an improved version of the HiFi-GAN neural vocoder. They utilized the pre-trained English vocoder available at GitHub.

Text to speech model pipeline

Results Text to Speech CVSS test set

They experimented with numerous models to initialize the encoder-decoder architecture to construct a direct text-to-speech translation system, including mBART25 and mBART50. They have created a reference cascade system comprised of a mBART50 machine translation module and a tacotron2 speech synthesis module. They tested various approaches on the CVSS dataset using high, medium, and low resource languages.

Their direct text-to-speech system performed similarly to the cascade system, particularly when initialized with mBART50. Notably, unlike the cascade method, their technique does not require knowledge of the transcription of the target language. When comparing the two mBART models, mBART50 consistently outperformed mBART25 across multiple languages, particularly those not included in mBART25.

There was a 40% relative improvement in BLEU score for pre-training languages, a 501% improvement for languages in mBART50, and a 136% improvement for languages not in either model. This highlights the need to employ a more diversified pre-trained multilingual model, such as mBART50.

BLEU results on CVSS data test partition for each language available in Text to speech model

Final remarks about Text to speech model

In this study, they developed a novel method for performing direct text-to-speech translation. This method is based on an encoder-decoder structure that takes text as input and outputs discrete acoustic units. As a result, multilingual text-to-speech translation can be conducted without explicit knowledge of the destination language’s text transcription. The system demonstrated in this research could be utilized for a variety of purposes, including creating audio books from texts in many languages. Furthermore, the proposed framework could be used to obtain enhanced data in order to increase datasets derived from low-resource languages.

The evaluation of this proposal was performed on the new CVSS dataset to corroborate the excellent performance attained with this approach to voice generation. In these studies, they also saw an improvement in performance when the model used as the initialization for the encoder-decoder architecture was pre-trained by integrating more languages from the CVSS dataset. This finding implies that cross-language learning may aid low-resource languages significantly in the text-to-speech translation job.

References

https://arxiv.org/pdf/2309.07478v1.pdf


Similar Posts

    Signup MLNews Newsletter

    What Will You Get?

    Bonus

    Get A Free Workshop on
    AI Development