{"id":2637,"date":"2023-09-04T10:39:33","date_gmt":"2023-09-04T10:39:33","guid":{"rendered":"https:\/\/mlnews.dev\/?p=2637"},"modified":"2023-09-19T15:14:39","modified_gmt":"2023-09-19T15:14:39","slug":"audioldm2-generating-audios-with-self-supervision","status":"publish","type":"post","link":"https:\/\/mlnews.dev\/audioldm2-generating-audios-with-self-supervision\/","title":{"rendered":"AudioLDM2: Generating universal audios with self-supervised pretraining"},"content":{"rendered":"\n<p>This study introduces AudioLDM2, an innovative and adaptable framework that may create any form of audio with flexible conditions and without the requirements. In AudioLDM2 research involves teams from CVSSP, the University of Surrey, Guildford, UK, and ByteDance.<\/p>\n\n\n\n<p>The central concept is to develop a new &#8220;language of audio&#8221; (LOA), which means the conversion of text into audio, speech into audio, and image into audio. This method enables us to convert human-understandable information into LOA and combine audio representations based on LOA.<\/p>\n\n\n\n<p>Sound generation is the process of creating sounds based on particular conditions, such as text, phonemes, or visuals. Deep-learning-based audio creation is frequently used to handle this problem, such as generating recordings of speech, music, sound effects, and specific sorts of sounds such as footfall and violin sounds.<\/p>\n\n\n\n<div class=\"wp-block-essential-blocks-advanced-video\"><div class=\"eb-parent-wrapper eb-parent-eb-advanced-video-jy8e0mt \"><div class=\"eb-advanced-video-wrapper eb-advanced-video-jy8e0mt none\" data-id=\"eb-advanced-video-jy8e0mt\"><div class=\"eb-player-wrapper eb-advanced-video-jy8e0mt\"><div class=\"eb-player-option none right\" data-id=\"eb-advanced-video-jy8e0mt\" data-url=\"https:\/\/youtu.be\/-busrydOzek\" data-option=\"none\" data-controls=\"false\" data-loop=\"true\" data-muted=\"true\" data-playing=\"true\" data-overlay=\"false\" data-light=\"https:\/\/mlnews.dev\/wp-content\/plugins\/essential-blocks\/assets\/images\/adv-video-placeholder.png\" data-customplayicontype=\"image\" data-customplayiconlib=\"fas fa-play-circle\" data-customplayicon=\"true\" data-playicon=\"https:\/\/mlnews.dev\/wp-content\/plugins\/essential-blocks\/assets\/images\/adv-video-playicon.svg\"><\/div><\/div><\/div><\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">AudioLDM2 Model <\/h2>\n\n\n\n<p>In past audio-related work there was a different model for all different types of conversion if the text has to be in audio people will have different models for that and if an image has to be converted in audio then they can&#8217;t do that with the same model they have to change the medium for the image to audio conversion. <\/p>\n\n\n\n<p>Now the researchers proposed a model called AudioLDM2 for public easiness. In AudioLDM2 people can text to audio, speech to audio, image to audio, text to music under a single model and it has advanced features and more realistic results than the previous models.<\/p>\n\n\n\n<p>In the future AudioLDM2 will be greatly used in the field of entertainment, animation, and producing audio. AudioLDM2 has realistic results and independent of the description its result generation leads this model to great advancement in the future.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What is AudioMAE in AudioLDM2 Model?<\/h2>\n\n\n\n<p>Audio Mask Autoencoder (AudioMAE) is a self-supervised pretraining framework for audio generation. AudioMAE is a great option for audio representation in generative tasks because it has been pre-trained on a variety of audio content and uses a generative and reconstructive pre-training scheme. For more information about AudioMAE and AudioLDM2 public can visit their <a href=\"https:\/\/audioldm.github.io\/audioldm2\/\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub<\/a> account where the implementation of their code and how this model works is all given in detail.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/mlnews-website.s3.amazonaws.com\/blog\/AudioLDM+2%3A+Generating+different+holistic+audios+with+self-supervised+pretraining\/AudioMAE.webp\" alt=\"AudioMAE feature space tends to group similar audios together, indicating more semantic.\" width=\"405\" height=\"155\"\/><\/figure><\/div>\n\n\n<h2 class=\"wp-block-heading\">AudioLDM2 results in different audio generation<\/h2>\n\n\n\n<h4 class=\"wp-block-heading\">Text to Audio Generation<\/h4>\n\n\n\n<p>Text prompts are generated by the <a href=\"https:\/\/mlnews.dev\/gpt-4s-content-moderation\/\">ChatGPT<\/a>. Audio files are generated by AudioLDM2 and here are two examples of audio generation by AudioLDM2 one is a dog tail-wagging sound and the other is the sound of forest wind. <\/p>\n\n\n\n<figure class=\"wp-block-audio alignleft\"><audio controls src=\"https:\/\/mlnews-website.s3.amazonaws.com\/blog\/AudioLDM+2%3A+Generating+different+holistic+audios+with+self-supervised+pretraining\/A_dog_s_tail_wagging_happily.flac\" loop><\/audio><figcaption class=\"wp-element-caption\"><em>A dog tail-wagging <\/em><em>happily.<\/em><\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-audio alignleft\"><audio controls src=\"https:\/\/mlnews-website.s3.amazonaws.com\/blog\/AudioLDM+2%3A+Generating+different+holistic+audios+with+self-supervised+pretraining\/A_forest_of_wind_chimes_singing_a_soothing_melody_in_the_breeze.flac\" loop><\/audio><figcaption class=\"wp-element-caption\">A forest of wind chimes singing <br>soothing melodies in the breeze.<\/figcaption><\/figure>\n\n\n\n<div style=\"height:\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img decoding=\"async\" src=\"https:\/\/mlnews-website.s3.amazonaws.com\/blog\/AudioLDM+2%3A+Generating+different+holistic+audios+with+self-supervised+pretraining\/text+to+audio.webp\" alt=\"Text to audio conversion in AudioLDM2\"\/><figcaption class=\"wp-element-caption\"><em>Text to audio conversion in AudioLDM2<\/em><\/figcaption><\/figure><\/div>\n\n\n<div style=\"height:\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h4 class=\"wp-block-heading has-text-align-left\">Text to Music Generation<\/h4>\n\n\n\n<p>This audio is also generated by AudioLDM2 there are two examples given below: trap beat and AUdioLDM2 have produced the sound for that music and the other one is traditional fiddle playing these both are music categories.<\/p>\n\n\n\n<figure class=\"wp-block-audio alignleft\"><audio controls src=\"https:\/\/mlnews-website.s3.amazonaws.com\/blog\/AudioLDM+2%3A+Generating+different+holistic+audios+with+self-supervised+pretraining\/A_catchy_trap_beat_with_EDM_Synthesizers_in_the_mix%2C_creating_a_unique_electronic_sound_with_ethereal_quality.flac\"><\/audio><figcaption class=\"wp-element-caption\"><audio src=\"https:\/\/mlnews-website.s3.amazonaws.com\/blog\/AudioLDM+2%3A+Generating+different+holistic+audios+with+self-supervised+pretraining\/A_catchy_trap_beat_with_EDM_Synthesizers_in_the_mix%2C_creating_a_unique_electronic_sound_with_ethereal_quality.flac\"><\/audio>&nbsp;A catchy trap beat with EDM <br>Synthesizers in the mix.<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-audio alignleft\"><audio controls src=\"https:\/\/mlnews-website.s3.amazonaws.com\/blog\/AudioLDM+2%3A+Generating+different+holistic+audios+with+self-supervised+pretraining\/A_traditional_Irish_fiddle_playing_a_lively_reel%2C_inviting_feet_to_dance_along_to_its_spirited_tune.flac\"><\/audio><figcaption class=\"wp-element-caption\">&nbsp;A traditional Irish fiddle playing <br>a lively reel.<\/figcaption><\/figure>\n\n\n\n<div style=\"height:\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img decoding=\"async\" src=\"https:\/\/mlnews-website.s3.amazonaws.com\/blog\/AudioLDM+2%3A+Generating+different+holistic+audios+with+self-supervised+pretraining\/text+to+music.webp\" alt=\"Text to music generation in AudioLDM2\"\/><figcaption class=\"wp-element-caption\"><em>Text to music generation in AudioLDM2<\/em><\/figcaption><\/figure><\/div>\n\n\n<div style=\"height:\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h4 class=\"wp-block-heading\">Image to audio generation<\/h4>\n\n\n\n<p>In this category, there are images according to which AudioLDM2 has generated the audio.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"alignleft size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/mlnews-website.s3.amazonaws.com\/blog\/AudioLDM+2%3A+Generating+different+holistic+audios+with+self-supervised+pretraining\/monalisa.webp\" alt=\"mona lisa \" width=\"131\" height=\"221\"\/><\/figure><\/div>\n\n\n<figure class=\"wp-block-audio alignleft\"><audio controls src=\"https:\/\/mlnews-website.s3.amazonaws.com\/blog\/AudioLDM+2%3A+Generating+different+holistic+audios+with+self-supervised+pretraining\/mona+lisa.flac\" loop><\/audio><\/figure>\n\n\n\n<div style=\"height:\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"alignleft size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/mlnews-website.s3.amazonaws.com\/blog\/AudioLDM+2%3A+Generating+different+holistic+audios+with+self-supervised+pretraining\/la+frande+jatte.webp\" alt=\"\" width=\"213\" height=\"155\"\/><\/figure><\/div>\n\n\n<figure class=\"wp-block-audio alignleft\"><audio controls src=\"https:\/\/mlnews-website.s3.amazonaws.com\/blog\/AudioLDM+2%3A+Generating+different+holistic+audios+with+self-supervised+pretraining\/la+grande+jatte.flac\"><\/audio><\/figure>\n\n\n\n<div style=\"height:\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img decoding=\"async\" src=\"https:\/\/mlnews-website.s3.amazonaws.com\/blog\/AudioLDM+2%3A+Generating+different+holistic+audios+with+self-supervised+pretraining\/image+to+audio.webp\" alt=\"Image to audio generation in AudioLDM2\"\/><figcaption class=\"wp-element-caption\">Image to audio generation in AudioLDM2<\/figcaption><\/figure><\/div>\n\n\n<div style=\"height:\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">GigaSpeech Dataset<\/h2>\n\n\n\n<p>There is a dataset of audio where a text is converted into audio by using AudioLDM2 and Ground truth.<\/p>\n\n\n\n<p>Jen says sesame street has improved over time in how it depicts apologies on the show.<\/p>\n\n\n\n<figure class=\"wp-block-audio alignleft\"><audio controls src=\"https:\/\/mlnews-website.s3.amazonaws.com\/blog\/AudioLDM+2%3A+Generating+different+holistic+audios+with+self-supervised+pretraining\/ground+truth.flac\"><\/audio><figcaption class=\"wp-element-caption\">Ground truth<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-audio alignleft\"><audio controls src=\"https:\/\/mlnews-website.s3.amazonaws.com\/blog\/AudioLDM+2%3A+Generating+different+holistic+audios+with+self-supervised+pretraining\/AudioLDM2.flac\"><\/audio><figcaption class=\"wp-element-caption\">AudioLDM2<\/figcaption><\/figure>\n\n\n\n<div style=\"height:\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading has-text-align-left\">Future of AudioLDM2<\/h2>\n\n\n\n<p>AudioLDM2 has a great future in the field of producing holistic Audio content generation through self-supervised pretraining. We expect that AudioLDM2 will play a vital role in the entertainment industry, virtual reality, and assistive technologies\u00a0fields. AudioLDM2 has the potential to create a highly realistic and immersive auditory experience. <\/p>\n\n\n\n<p>With ongoing advancements and research in this domain, we can expect AudioLDM2 to contribute significantly to the evolving landscape of audio content generation, pushing the boundaries of what&#8217;s possible in the realm of sound synthesis.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">AudioLDM2: Related studies and research<\/h2>\n\n\n\n<p>For more information about AudioLDM2, anyone can visit <a href=\"https:\/\/arxiv.org\/abs\/2308.05734\" target=\"_blank\" rel=\"noreferrer noopener\">Arixv <\/a>and <a href=\"https:\/\/github.com\/haoheliu\/audioldm2\" target=\"_blank\" rel=\"noopener\">Github <\/a>anytime. In this whole research paper, code work and related all material which is used in that experiment are available. These are accessible anytime and from anywhere there is no restriction in reaching that material about AudioLDM2. Demos are also available related to AudioLDM2 and its research work.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Potential application of AudioLDM2<\/h2>\n\n\n\n<p>AudioLDM2 has various Applications across various industries. In the entertainment industry, AudioLDM2 technology transforms sound design and music composition by allowing composers and makers to create unique and immersive audio material easily. In virtual reality and augmented reality experiences, AudioLDM2 enhances virtual worlds making them more attractive and engaging for people.<\/p>\n\n\n\n<p>In assistance technologies, AudioLDM2 facilitates a more natural and human-interacting acting context. AudioLDM2 is set to be an innovator, closing the gap between human creativity and machine-generated audio in a variety of unique manners, whether it is constructing innovative soundscapes, improving interactive simulations, or boosting accessibility.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Architecture of AudioLDM2<\/h2>\n\n\n\n<p>AudioLDM 2&nbsp;<em>latent diffusion model (LDM)<\/em>&nbsp;is a text-to-audio&nbsp;conversion model&nbsp;that learns continuous audio representations from text implanted.<\/p>\n\n\n\n<p>The AudioMAE feature acts as a bridge between the audio semantic language model (GPT-2) and the semantic reconstruction stage LDM(latent diffusion model).<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/mlnews-website.s3.amazonaws.com\/blog\/AudioLDM+2%3A+Generating+different+holistic+audios+with+self-supervised+pretraining\/AudioLDM+image.webp\" alt=\"AudioLDm2 architecture describing all details of the model\"\/><\/figure>\n\n\n\n<p>As the condition, the probabilistic switcher adjusts the probability of the latent diffusion model using the ground truth AudioMAE (Pgt) and the GPT-2 produced AudioMAE feature (Ppred). AudioMAE and latent diffusion models are both self-supervised and pre-trained with audio data.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>In this study, they introduce AudioLDM2, a text-to-audio, text-to-music, and text-to-speech generation tool that achieves state-of-the-art or similar performance on generating tasks. The language of audio (LOA) allows for self-supervised pre-training of the latent diffusion model (LDM) and offers a solid framework for the audio production challenge. <\/p>\n\n\n\n<p>By applying in-context learning and expanding AudioLDM2 to image-to-audio production, they further highlight the adaptability of their suggested approach. Future work on audio generation from a unified perspective is made possible by AudioLDM2. Future research will concentrate on enabling the GPT model&#8217;s multitask learning so that it may simultaneously generate audio, music, and speech using a single model.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/mlnews-website.s3.amazonaws.com\/blog\/AudioLDM+2%3A+Generating+different+holistic+audios+with+self-supervised+pretraining\/conclusion+image.webp\" alt=\"The understanding\u00a0capacity of AudioLDM 2 in context. The ground truth audio is displayed in the left column, with the first 2.5 seconds serving as context for audio generation. The right column displays the continuation of the audio context. For better demonstration, we manually add a 0.15 second beep sound before the next section.\" width=\"890\" height=\"386\"\/><figcaption class=\"wp-element-caption\"><em>The AudioLDM 2 architecture is described in detail.<\/em><\/figcaption><\/figure><\/div>\n\n\n<h2 class=\"wp-block-heading\">References<\/h2>\n\n\n\n<p><a href=\"https:\/\/audioldm.github.io\/audioldm2\/\" target=\"_blank\" rel=\"noopener\">https:\/\/audioldm.github.io\/audioldm2\/<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/arxiv.org\/abs\/2308.05734\" target=\"_blank\" rel=\"noopener\">https:\/\/arxiv.org\/abs\/2308.05734<\/a><\/p>\n\n\n\n<div class=\"wp-block-group post-tag-div is-layout-constrained wp-block-group-is-layout-constrained\"><div class=\"wp-block-group__inner-container\">\n<hr class=\"wp-block-separator has-text-color has-cyan-bluish-gray-color has-alpha-channel-opacity has-cyan-bluish-gray-background-color has-background\"\/>\n\n\n\n<p class=\"has-medium-font-size\"><strong>Read More<\/strong><\/p>\n\n\n<div class=\"taxonomy-post_tag post-tag wp-block-post-terms\"><a href=\"https:\/\/mlnews.dev\/tag\/audio\/\" rel=\"tag\">Audio<\/a><span class=\"wp-block-post-terms__separator\">   <\/span><a href=\"https:\/\/mlnews.dev\/tag\/audioldm2\/\" rel=\"tag\">AudioLDM2<\/a><span class=\"wp-block-post-terms__separator\">   <\/span><a href=\"https:\/\/mlnews.dev\/tag\/self-supervised\/\" rel=\"tag\">Self-Supervised<\/a><\/div><\/div><\/div>\n\n\n\n<hr class=\"wp-block-separator has-text-color has-cyan-bluish-gray-color has-alpha-channel-opacity has-cyan-bluish-gray-background-color has-background\"\/>\n\n\n\n<p class=\"has-medium-font-size\"><strong>Similar Posts<\/strong><\/p>\n\n\n<ul class=\"wp-block-latest-posts__list is-grid columns-3 has-dates wp-block-latest-posts\"><\/ul>","protected":false},"excerpt":{"rendered":"<p>This study introduces AudioLDM2, an innovative and adaptable framework that may create any form of audio with flexible conditions and without the requirements. In AudioLDM2 research involves teams from CVSSP, the University of Surrey, Guildford, UK, and ByteDance. The central concept is to develop a new &#8220;language of audio&#8221; (LOA), which means the conversion of [&hellip;]<\/p>\n","protected":false},"author":10,"featured_media":2689,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_eb_attr":"","footnotes":""},"categories":[271],"tags":[207,189,183],"_links":{"self":[{"href":"https:\/\/mlnews.dev\/wp-json\/wp\/v2\/posts\/2637"}],"collection":[{"href":"https:\/\/mlnews.dev\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mlnews.dev\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mlnews.dev\/wp-json\/wp\/v2\/users\/10"}],"replies":[{"embeddable":true,"href":"https:\/\/mlnews.dev\/wp-json\/wp\/v2\/comments?post=2637"}],"version-history":[{"count":29,"href":"https:\/\/mlnews.dev\/wp-json\/wp\/v2\/posts\/2637\/revisions"}],"predecessor-version":[{"id":2978,"href":"https:\/\/mlnews.dev\/wp-json\/wp\/v2\/posts\/2637\/revisions\/2978"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/mlnews.dev\/wp-json\/wp\/v2\/media\/2689"}],"wp:attachment":[{"href":"https:\/\/mlnews.dev\/wp-json\/wp\/v2\/media?parent=2637"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mlnews.dev\/wp-json\/wp\/v2\/categories?post=2637"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mlnews.dev\/wp-json\/wp\/v2\/tags?post=2637"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}