{"id":2637,"date":"2023-09-04T10:39:33","date_gmt":"2023-09-04T10:39:33","guid":{"rendered":"https:\/\/mlnews.dev\/?p=2637"},"modified":"2023-09-19T15:14:39","modified_gmt":"2023-09-19T15:14:39","slug":"audioldm2-generating-audios-with-self-supervision","status":"publish","type":"post","link":"https:\/\/mlnews.dev\/audioldm2-generating-audios-with-self-supervision\/","title":{"rendered":"AudioLDM2: Generating universal audios with self-supervised pretraining"},"content":{"rendered":"\n

This study introduces AudioLDM2, an innovative and adaptable framework that may create any form of audio with flexible conditions and without the requirements. In AudioLDM2 research involves teams from CVSSP, the University of Surrey, Guildford, UK, and ByteDance.<\/p>\n\n\n\n

The central concept is to develop a new “language of audio” (LOA), which means the conversion of text into audio, speech into audio, and image into audio. This method enables us to convert human-understandable information into LOA and combine audio representations based on LOA.<\/p>\n\n\n\n

Sound generation is the process of creating sounds based on particular conditions, such as text, phonemes, or visuals. Deep-learning-based audio creation is frequently used to handle this problem, such as generating recordings of speech, music, sound effects, and specific sorts of sounds such as footfall and violin sounds.<\/p>\n\n\n\n

<\/div><\/div><\/div><\/div><\/div>\n\n\n\n

AudioLDM2 Model <\/h2>\n\n\n\n

In past audio-related work there was a different model for all different types of conversion if the text has to be in audio people will have different models for that and if an image has to be converted in audio then they can’t do that with the same model they have to change the medium for the image to audio conversion. <\/p>\n\n\n\n

Now the researchers proposed a model called AudioLDM2 for public easiness. In AudioLDM2 people can text to audio, speech to audio, image to audio, text to music under a single model and it has advanced features and more realistic results than the previous models.<\/p>\n\n\n\n

In the future AudioLDM2 will be greatly used in the field of entertainment, animation, and producing audio. AudioLDM2 has realistic results and independent of the description its result generation leads this model to great advancement in the future.<\/p>\n\n\n\n

What is AudioMAE in AudioLDM2 Model?<\/h2>\n\n\n\n

Audio Mask Autoencoder (AudioMAE) is a self-supervised pretraining framework for audio generation. AudioMAE is a great option for audio representation in generative tasks because it has been pre-trained on a variety of audio content and uses a generative and reconstructive pre-training scheme. For more information about AudioMAE and AudioLDM2 public can visit their GitHub<\/a> account where the implementation of their code and how this model works is all given in detail.<\/p>\n\n\n

\n
\"AudioMAE<\/figure><\/div>\n\n\n

AudioLDM2 results in different audio generation<\/h2>\n\n\n\n

Text to Audio Generation<\/h4>\n\n\n\n

Text prompts are generated by the ChatGPT<\/a>. Audio files are generated by AudioLDM2 and here are two examples of audio generation by AudioLDM2 one is a dog tail-wagging sound and the other is the sound of forest wind. <\/p>\n\n\n\n

A dog tail-wagging <\/em>happily.<\/em><\/figcaption><\/figure>\n\n\n\n
A forest of wind chimes singing
soothing melodies in the breeze.<\/figcaption><\/figure>\n\n\n\n
<\/div>\n\n\n
\n
\"Text
Text to audio conversion in AudioLDM2<\/em><\/figcaption><\/figure><\/div>\n\n\n
<\/div>\n\n\n\n

Text to Music Generation<\/h4>\n\n\n\n

This audio is also generated by AudioLDM2 there are two examples given below: trap beat and AUdioLDM2 have produced the sound for that music and the other one is traditional fiddle playing these both are music categories.<\/p>\n\n\n\n