MLNews

MaskedSpeech: Masking Strategy for Context-aware Speech Generation

A game-changer comes up in a world where computer-generated speech frequently falls cold. Say goodbye to the lifeless, robotic conversation and hello to the future of spoken language synthesis. It doesn’t just join words together; it understands each sentence’s emotions, context, and depth. JD Technology Group is involved in the research of the MaskedSpeech model.

MaskedSpeech is unique. It considers both the meaning and sounds of the words in a sentence and the sentences around it. It accomplishes this by “masking” the current sentence’s sounds and merging them with the sounds and meanings of the other phrases. This allows it to interpret and generate more natural-sounding speech.

Furthermore, it utilizes several forms of information to make the speech more expressive, such as the general meaning of a conversation. This helps it produce better-sounding speech. MaskedSpeech outperformed prior systems in terms of sounding natural and expressive in experiments.

MaskedSpeech: Masking Strategy for Context-aware Speech Generation

Prior works limitations

Current text-to-speech (TTS) systems could generate high-quality speech with the advancement of deep learning technology. Most speech synthesis systems, however, just consider the information in the current sentence to be synthesized.

Ignore the surrounding text and speech. With increased expectations for the naturalness and expressiveness of TTS systems, these speech synthesis systems, while capable of synthesizing audiobooks, lack the ability to synthesize paragraph-level speech human speech requires appropriate naturalness and expressiveness. A paragraph’s context must be consistent.

Recent efforts have included semantic elements of the present sentence or contextual phrases to improve the grammar of synthesized speech, particularly for paragraph speech creation. Using a pre-trained BERT model, Xiao et al. retrieved semantic information from the present sentence and demonstrated that the addition of semantic elements might improve the pronunciation of synthesized speech.

Introduction to MaskedSpeech

Some publications exploited acoustics features to improve prosody modeling, although the majority of them retrieved sentence-level prosody representations while losing fine-grained information from contextual acoustic data. Furthermore, selecting appropriate reference audio is difficult for the reference-based models. 

The auditory encoder is for learning fine-grained structure features as opposed to utterance-level information. They claim that their model can solve the problem of one-to-many mapping. However, only consider speech from the current sentence, and when reference audio is unavailable during inference, the prosody features must be estimated from the text. 

In this paper, they introduced MaskedSpeech, which uses FastSpeech2 as the network backbone and adds both contextual semantic and acoustic data to improve syntax synthesis for a paragraph. Inspired by earlier speech editing efforts, they concatenate the acoustic aspects of previous and present sentences and mask out those from the current sentence, and then those features are employed as extra decoder input for the proposed network. The term “semantic” refers to the study of meaning in a sentence and acoustic means understanding spoken words in the sentence.

In the decoder section, the network could learn fine-grained prosody features from contextual speech, and no prosody prediction from text is necessary. Phonemes from the prior and current sentences are concatenated and passed to the proposed model’s phoneme encoder, which aids the model in learning fine-grained semantic information from the previous sentence.

The two sentences are separated by “**”, the first sentence is a natural recording and the second sentence is the current synthesized speech by MaskedSpeech.

Translated Text: But since you have such wishful thinking and plan to gain a foothold in Yancheng. **Then maybe I have to teach you some of the rules of Yancheng!

FastSpeech2 generated audio
Random audio recording
MaskedSpeech generated audio

A cross-utterance (CU) encoder is used to extract coarse-grained sentence-level semantic representations from sentences around the targeted sentence, allowing the proposed model to capture the semantic correlation between contextual sentences.

The proposed MaskedSpeech is trained by reconstructing masked acoustic characteristics using contextual text, contextual voice, and sentence-level semantic representation, allowing MaskedSpeech to improve prosody modeling and naturally ease the one-to-many mapping issue.

Future of MaskedSpeech

MaskedSpeech’s future potential and revolutionary approach to voice synthesis is very exciting. We can see this paradigm growing as technology advances to produce ever more emotionally resonant and contextually informed dialogues. Consider a world in which AI-powered virtual assistants, voice-controlled devices, and interactive software comprehend not only our words but also our emotions, responding with empathy and depth.

 MaskedSpeech has the potential to revolutionize human-computer interaction by making it more natural, engaging, and emotionally related than ever before with further refinement and incorporation into many fields. The possibilities are many, ranging from improved customer service encounters to more immersive storytelling experiences, ushering in a new era of human-AI collaboration.

Research study of MaskedSpeech

All of the details of this amazing study may be found on isca-speech. These are the places where researchers and developers publish their work for public consumption around the world.

Anyone can use it because it is open to all. Even better, it’s open source, which means anyone may use and expand on the code that makes it work. So, if you’re a developer or an artist, you can use MarkedSpeech right now to turn your speech into more natural speech.

Potential applications of MaskedSpeech

The different features of MaskedSpeech open the door to a variety of new applications. Consider virtual assistants who not only respond to your directions but also understand and communicate emotion in their responses, making everyday activities feel more like discussions with a helpful friend. Emotionally intelligent chatbots, which provide compassionate and effective communication, have the potential to revolutionize customer assistance and mental health counseling. Narrations for audiobooks could become more vivid and interesting, improving the storytelling experience.

Furthermore, interactive video games and language learning platforms could give a more realistic and immersive experience for consumers. MaskedSpeech can help people with speech problems, increase content production, and improve medical training simulations. Its potential to improve in-car voice assistants, improve call center operations, and transform accessibility tools demonstrates the enormous impact this technology can have on a wide range of industries, ushering in an era of emotionally connected AI encounters.

Methods used in MaskedSpeech

The MaskedSpeech model:

A phoneme encoder, a variance adapter, a decoder, and a PostNet module comprise the backbone network. Concatenated sentences are used as the input of the phoneme encoder to improve the prosody for the speech in a paragraph by contextual features, and a CU text encoder is used to collect CU semantic features from neighboring phrases. Furthermore, a masked mel-encoder is developed to extract local prosody features from the contextual speech, from which the decoder can learn global prosody dependency. By conditioning on contextual semantic and acoustic data, the suggested MaskedSpeech solves the one-to-many mapping problem and can generate speech with better prosody for a paragraph, and the generated speech is more coherent and consistent with contextual speech.

Learning from contextual text: The current sentence’s prosody may be influenced by the relative sentence position, discourse relations (DRs) in neighboring sentences, the emotion of contextual sentences, and so on. They use both fine-grained contextual phonemes and coarse-grained sentence-level semantic variables in the proposed model to improve prosody creation via contextual text.

Learning from contextual speech: In a paragraph, human speech is naturally coherent; nonetheless, the prosody and emotion of the speech may be influenced by contextual speech. Contextual speech typically offers rich, expressive, and detailed prosody information that could be used to improve the prosodic performance of the present sentence in speech synthesis systems, particularly for paragraph speech generation.

Architectural diagram of MaskedSpeech

Training and interference strategies:

Training During training, the model is trained to reconstruct the masked acoustic data and predict the prosody qualities of the current sentence. In other words, they just compute training objectives for the present sentence. Mean squared error (MSE) is utilized as the training objective for pitch, energy, and duration prediction, and mean absolute error (MAE) is used for mel-spectrogram reconstruction.

Inference In contrast to the training approach, in which ground truth contextual text and speech are available and sequential with the present phrase, there is no ground truth in previous speech in the inference procedure, albeit contextual text is provided for paragraph speech synthesis. To address this issue, they attempted to choose a text-audio combination at random from the training corpus and use it as the contextual speech and phonemes for the proposed model.

Because this random selection would differ from the training strategy, an AB preference test is performed to compare the system performance with ground truth contextual speech and randomly selected contextual speech. The AB test result revealed that no significant difference was observed for these two different types of contextual speech.

Comparison diagram of MaskedSpeech with other

Case study of MaskedSpeech

The mel-spectrograms of speech synthesized by three different models and natural recording are shown figure. Each sub-figure depicts the concatenation of mel-spectrograms from previous and current sentences, with distinct silence acting as a border. The first sentence describes natural recording, whereas the second sentence describes synthesized audio.

The speaking speed of the natural recording is substantially faster, and the fundamental frequency is lower. MaskedSpeech perceives this quality based on contextual semantic and auditory information and creates speech with a lower pitch and a faster speaking rate. The prosodic transition of the subsequent sentences sounds seamless and meaningful as a result.

A comparison between the mel-spectrograms of three
models and natural recordings.
A comparison between the mel-spectrograms of three models and natural recordings.

Final remarks about MaskedSpeech

This paper proposes a context-aware voice synthesis system with better prosody creation by utilizing contextual semantic and auditory data from neighboring sentences.

The proposed model learns to reconstruct the masked melspectrograms using the concatenated and masked melspectrograms as augmented input. The phoneme encoder considers phonemes from context sentences, as well as phonemes from the current phrase, and a CU encoder, is utilised to derive crossutterance semantic representation. The experimental findings demonstrated the efficacy of the suggested MaskedSpeech model.

Reference

https://www.isca-speech.org/archive/pdfs/interspeech_2023/zhang23n_interspeech.pdf


Similar Posts

    Signup MLNews Newsletter

    What Will You Get?

    Bonus

    Get A Free Workshop on
    AI Development