MLNews

AUDIOTOKEN: Powerful Diffusion Models Based on Text for Audio-to-Image Synthesis

Imagine a future with AI where novel technology converts audio into stunning graphics. Dive into this model’s exciting journey, in which they use latent diffusion models to generate art from audio. The Hebrew University of Jerusalem, Technion-Israel Institute of Technology Tel-Aviv University, and NetApp are involved in the research of this model.

The evolution of diffusion models has resulted in enormous progress in picture generation in recent years. These models excel at producing high-quality photographs, but they are often guided by textual descriptions.

In this research, they provide a unique method that uses latent diffusion models, which were originally meant to convert text into images, to generate images based on audio recordings. Their method involves encoding audio data into a unique token using a pre-trained algorithm. Consider this token to be a link between audio and text representations. The beauty of this strategy is its simplicity, as it requires only a few trainable parameters. As a result, this strategy is particularly appealing for lightweight optimization.

Prior studies of AUDIOTOKEN

The way people consume digital content has been altered by neural generative models. From producing high-quality photos to text consistency over great distances to speech as well as audio. Diffusion-based generative models have become the favored approach in recent years, with promising outcomes in a variety of applications.

The model learns to map a specified noise distribution to the target data distribution during the diffusion phase. The model learns to predict the noise at each phase of the diffusion process in order to generate the signal from the target distribution. Diffusion models work with various types of data representations, such as raw input and latent representations.

When considering controlled generative models, it is now common practice to condition the generation on a textual description of the input data; this is particularly obvious in image generation. Recently, numerous approaches for conditioning the generative process have been proposed, including image-to-audio,  image-to-speech, image-to-text, video-to-text, and audio-to-audio. However, the community has studied this study directionless.

Introduction of AUDIOTOKEN

The task of audio-to-image generation is the focus of this work. Given an arbitrary sound in an audio sample, they seek to build a high-quality image presenting the acoustic landscape. They propose learning a modified layer mapping between their outputs and inputs by combining a pre-trained text-to-image generation model and a pre-trained audio representation model. In particular, They were inspired by recent work on textual inversions.

They propose to train a dedicated audio token that transfers the audio representations into an embedding based on a textual inversion vector. A vector of this type is then forwarded into the network as a new word embedding is reflected in the continuous representation.

Why not utilize text as a conditioning signal for image generation instead of audio signals? Although text-based generative models can produce beautiful images, textual explanations are frequently inserted manually because they are not naturally associated with the image. When it comes to videos, music, and photos, they all capture and portray the same situation and are thus naturally coupled.

AUDIOTOKEN output image using audio token

Furthermore, audio signals can represent complicated situations and objects such as distinct types of the same instrument (for example, classic guitar, electric guitar, etc.) or different scenarios of the same object (for example, classic guitar recorded in the studio vs. live show). Annotating such fine-grained characteristics of many objects is time-consuming and thus difficult to scale.

In short, they propose a novel method AUDIOTOKEN for audio-to-image generation by combining a pre-trained text-to-image diffusion model with a pre-trained audio encoder; and they propose a set of evaluation metrics dedicated specifically to the task of audio-to-image generation. Several trials demonstrate that their technology can generate a high-quality and broadened set of images based on audio scenes.

Future of AUDIOTOKEN

It has the potential to enable the construction of highly immersive virtual experiences where audio-driven images respond dynamically to music or narrative components in the realms of entertainment and media. This technology could help doctors with diagnostics and improve patient understanding in healthcare by facilitating the visualization of medical data.

It could be used by the advertising and marketing industries to create captivating, personalized multimedia campaigns that resonate profoundly with consumers. Furthermore, interactive instructional information that responds to auditory input could assist the education profession by offering a more interesting and effective learning experience.

Related studies and research of AUDIOTOKEN

The related research paper to AUDIOTOKEN is available on Arxiv.  The implementation code with some examples is also available on GitHub. These resource for AUDIOTOKEN is open source and available at any time. Any person who wants to study about AUDIOTOKEN can visit these links and study more or can do research on that topic.

Potential application of AUDIOTOKEN

The applications of audio-to-image generation technologies are many and transformational. It could enable artists and designers to produce dynamic, audio-responsive digital art in the area of creative arts, while the gaming industry could employ it to generate immersive, audio-driven visual effects. Real-time audio-to-tactile or visual conversions could help accessibility initiatives by assisting those with visual impairments.

Music visualization software could provide an appealing method for people to hear music, and content makers could utilize it to automatically enhance their movies. This technique could aid in the reconstruction of events from audio recordings in forensics, and businesses could use it for market research and enhancing user engagement.

Adaptations of AUDIOTOKEN

AudioToken:

Audio signals carry information that can assist in imagining the scene that generated them. This makes it attractive to generate a scene using a generative model that is conditioned by audio recordings. Models that generate high-quality images, on the other hand, frequently rely on large-scale text-image combinations to generate images using text. As a result, they offer AUDIOTOKEN, a method that effectively projects audio signals into a textual space, allowing us to use current text-conditioned models to generate images based on audio-based tokens.

Qualitative results for Wav2Clip (first row), ImageBind (second row), AUDIOTOKEN (third row), and the original
reference images (last row).

Audio encoding:

Embedder represents audio using a pre-trained audio categorization network. Because the last layer of a discriminative network is frequently utilized for classification, it tends to reduce significant auditory information that is extraneous to the discriminative task. As a result, they take a concatenation of previous levels and the final concealed layer (particularly, the fourth, eighth, and twelfth layers out of a total of twelve). As a result, the audio is temporally embedded.

Optimization:

During the optimization phase, they only update the weights of the linear and attentive pooling layers within the Embedder network during training. The generative network and the pre-trained audio network stay frozen. They use the loss function used by the original model LLDM to ensure consistency in the training method.

Furthermore, they create an extra loss function to supplement the original one, which includes encoding the video label, and the label’s length (for example, the size of the ‘acoustic guitar’ label is two). The label is encoded using the textual encoder of the generative model, and the spatial dimension is decreased using average pooling.

Results of the experimentation of AUDIOTOKEN

In the experimentation, they examine their strategy from both objective and subjective perspectives. They begin by going over the experimental setup in detail. Then, using the evaluation methodology they report results for their technique and baselines.

Baselines:

For audio-text pairs, Wav2Clip applies a CLIP-based loss. Then, using VQ-GAN, they employ this representation to generate an image from a text that is highly associated with the audio. ImageBind combines information from six various modalities into a single representation space (text, image/video, audio, depth, temperature, and inertial measurement units (IMU)). To produce images from audio samples, they utilized ImageBind’s unified latent with stable-diffusion-2-1-unclip.

Data:

They employ the VGGSound dataset, which is derived from a collection of YouTube videos and audiovisual data. The collection contains 200, 000 videos, each lasting ten seconds. The dataset has 309 classes that have been annotated.

Qualitative results of speaker generation for AUDIOTOKEN (first row), and reference images (second row).
Qualitative results of speaker generation for AUDIOTOKEN (first row), and reference images (second row).

Hyperparameters:

During training, they cut five-second audio clips at random and chose the frame with the greatest CLIP score that corresponds to the VGGSound label. They also remove frames with inconsistencies in categorization from the image and audio classifiers. On the Nvidia A6000 GPU, they train the model for 60, 000 steps at an 8e-5 learning rate and a batch size of 8.

Objective evaluation:

When all evaluation metrics are included, AUDIOTOKEN outperforms Wav2Clip and ImageBind. Interestingly, AUDIOTOKEN outperforms the AIS metric, which uses the Wav2Clip and ImageBind models to calculate the similarity score. This result displays accurate audio detail identification (e.g., differentiating between different guitars) and takes into account multiple entities (e.g., multiple flying planes or a single plane). Using textual labels achieves improved accuracy and drives the model toward learning representation that is more discriminative but less correlated with the target video, as expected.

Subjective evaluation:

Using textual descriptions, they compare AUDIOTOKEN to Wav2Clip and SD. They take 15 photos at random from the test set and ask human annotators to rate their relevance to their textual labels on a scale of 1 to 5. For each of the evaluated photos, they enforce at least 17 annotations and compute the mean score as well as its standard deviations.

Wav2Clip is outperformed by AUDIOTOKEN (4.07 0.83 vs. 1.85 0.46). When compared to SD using text labels, AUDIOTOKEN achieves equivalent performance but produces somewhat lower subjective scores (4.07 0.83 vs. 4.58 0.60). These findings are very encouraging since they indicate that users found the visuals generated by AUDIOTOKEN to capture the important objects in the audio environment in a manner similar to employing textual labels, which serve as a topline.

Speaker image generation:

They look into its ability to generate visuals of various speakers. To achieve this purpose, they collected samples from two 30-minute movies featuring Barack Obama, Donald Trump, Emma Watson, and David Beckham per person and extracted the audio representation from XVector. Their findings show that their method accurately portrays Barack Obama and Donald Trump. They believe that this is due to their unique voices. The technology, however, primarily catches Emma Watson and David Beckham’s gender.

Conclusion

They offer a strategy for using text-conditioned generative models for audio-based conditioning in this research. This approach generates high-quality graphics that depict a scene from an audio recording. Furthermore, they provide a comprehensive evaluation methodology that takes the semantics of the created images into consideration. This technology is a first step towards creating audio-conditioned images. The hidden information in the audio is more detailed than the observed information in the text. As a result, they believe that this issue is important and that the community should pay greater attention to it.

References

https://arxiv.org/pdf/2305.13050.pdf

https://github.com/guyyariv/AudioToken


Similar Posts

    Signup MLNews Newsletter

    What Will You Get?

    Bonus

    Get A Free Workshop on
    AI Development