MLNews

GRAVO: Generate Relevant Audio from Visual Features with Noisy Online Videos

GRAVO, a video-to-audio technology achievement, is here! Consider a world where each frame of a video adds to the emotion and realism of the sound. Previous attempts failed because they ignored crucial visual indications. GRAVO, on the other hand, is a game changer. Researchers for this model are from Gwangju Institute of Science and Technology, Korea, Nankai University, China, and Microsoft, China.

GRAVO (Generate Relevant Audio from Visual Features with Online Videos) is a new approach for extracting audio from videos. Previous techniques had the disadvantage of just considering the current image and audio token when generating sound, neglecting other images in the video that could be beneficial.

GRAVO addresses this issue by employing a technology known as multi-head attention (MHA), which allows it to pay attention to different portions of the video and integrate its knowledge while making sound. This contributes to the resulting audio being more accurate and realistic.

GRAVO also employs two additional techniques to ensure that MHA functions properly: it attempts to match the output of MHA to the target audio and retains the original visual information.

Input Video name:

A teddy bear painting a portrait.

GRAVO output:

Limitations of prior work

Deep models for generation tasks, such as applying large-scale data and complex neural networks to generate realistic pictures based on text descriptions, have been popular in both academia and industry.

Previous video-to-audio generation methods use a hierarchical auto-regressive language model to generate a sequence of audio tokens that can be decoded into a waveform given a video. The audio is generated using only the previous audio token and the current image, ignoring any surrounding images that may contain useful information.

Im2Wav, the most recent modern technique, makes use of a vast number of image-audio pairs from online videos. Im2Wav is a Transformer-based audio Language Model that takes as input an image representation of the pre-trained CLIP.

Im2Wav generates audio tokens based on prior acoustic tokens and temporally alignedย visual features, which are then translated into waveforms using the matching audio decoder from the pre-trained VQ-VAE model. Given previous mismatches between images and audio, it can be claimed that using temporally aligned image representations as input could result in inaccurateย results during the production process.

prior Im2Wav model of GRAVO

Understanding about GRAVO

They present the GRAVO (Generate Relevant Audio from Visual Data with Online Videos) model, that puts multi-head attention (MHA) on top of the visual data collected with the CLIP model. The MHA mechanism proposed in this GRAVO model allows each frame in the image sequence to access all of the information across the whole sequence, allowing the model to learn the inter-frame relationships and deliver more precise information for audio synthesis.

Input: Hyper-realistic spaceship landing on Mars.

Output of GRAVO:

A second regularizer loss is used to minimize the variability of the learned visual characteristics in order to keep the original visual information while assuring alignment between image and audio representation.

GRAVO is tested using the VGG-Sound and ImageHear datasets, which contain videos and single images, respectively. GRAVO surpassed Im2Wav in both audio classification accuracy and audio-visual similarity score. GRAVO achieves improved audio classification accuracy with a 3.8% improvement on VGG-Sound, as well as a higher audio-visual similarity score of 0.37.

Text: Saxophone

Image:

GRAVO output

GRAVO scope in future years

GRAVO’s unique method of video-to-audio generation promises to change our multimedia experiences in the future. Consider watching your favorite films and experiencing the subtle emotions of each moment as the audio compliments the visual storyline wonderfully.

GRAVO’s ability to harness the rich context within videos will find applications in fields such as virtual reality, education, and content creation, delivering immersive and emotionally resonant experiences that blur the lines between the visual and auditory worlds. GRAVO is ready to usher in a new era of audio-visual synergy, forever altering how we perceive and interact with digital material.

Related research and study of GRAVO

The related research paper on this model is available on isca-speech. This research is available for all the public and all material is open source. Any person who is interested in a deep study of GRAVO can visit the website and check the detailed research paper on this model.

Potential applications of GRAVO

GRAVO’s ground-breaking technology has enormous promise across multiple domains. GRAVO has the potential to improve the quality of films, television shows, and video games by offering more accurate and emotionally evocative soundtracks. GRAVO can be used by directors and game developers to ensure that every visual aspect is matched with audio that complements the atmosphere and narrative, increasing viewer immersion and engagement.

GRAVO’s applications go beyond entertainment and into education. This technology can be used by online learning platforms to provide more engaging and immersive instructional content. Imagine students being able to investigate historical events through multimedia presentations that not only show them the pictures but also transmit the sounds and emotions of the time, actually bringing history to life.

GRAVO could be used in healthcare to improve medical training simulations, allowing practitioners to practice operations or medical procedures while receiving realistic auditory feedback, adding another element of authenticity to their training.

Methods used in GRAVO

GRAVO’s purpose is to provide high-quality sound for a particular image or image sequence. The model is divided into three modules: the pre-trained image encoder extracts visual representations, the attention-based conditional audio generator translates these image features into a series of discrete tokens, and the pre-trained audio decoder finally restores the waveform.

Visual and audio representation:

The image feature extractor is a pre-trained CLIP model, and the audio feature extractor is a pre-trained VQ-VAE model. The CLIP model is intended to maximize the similarity score of text and image pairs in order to discover the correlation between them. Previous research shows that CLIP effectively captures the underlying semantics of images and performs excellently in audio production tasks.

As an audio feature extractor, the GRAVO model employs the pre-trained one-dimensional hierarchical VQ-VAE model from Im2Wav. It is made up of three parts: an audio encoder, a quantizer with multi-level codebooks, and a decoder.

The encoder converts into a set of latent vectors, which are subsequently split into two-level representations with shorter sequence lengths. The quantizer transforms these representations into two-level discrete tokens, each with its own codebook. The audio waveform is then recovered by the decoder, which is conditioned on the discrete tokens. During GRAVO model training, all audio in the dataset is tokenized and used as anticipated targets for the conditional audio generator. Finally, the wave reconstructor is the pre-trained VQ-VAE decoder.

Architectural diagram of GRAVO

Conditional Audio generation:

The conditional audio generator’s purpose is to predict discrete audio tokens given a sequence of picture features. It consists of two auto-regressive language models, Up and Low, for coarse to fine generation. The two language models are used at various time resolutions. The Low model is in charge of defining the semantic information of the generation, whereas the Up model is in charge of filling in the fine details. Previous research, at each generation step, conditions the Low model on the temporally aligned visual representation.

Relevance Guided Multi-head Attention:

At each generation time step, the suggested multi-head attention module enhances the receptive field. However, without sufficient instruction, the model may struggle to determine the most relevant picture representation that correlates to the current audio segment. To improve the model’s performance, they suggest the introduction of two auxiliary losses that leverage the Wav2CLIP embedding to lead the MHA to learn the audio-relevant image features, allowing for better image-audio alignment.

Classifier free guidance:

They apply the classifier-free guidance technique to adjust the trade-off between sample quality and variety, as in previous experiments. As with a learned-null embedding during training, with a probability of 0.5 for each sample in the batch. During Low model inference, we generate audio tokens by summing the probabilities with and without visual conditioning.

Results of the experimentation of GRAVO

The results of the two datasets are shown in Table. As baselines, They compare SpecVQGAN and Im2Wav. It is clear that their model outperforms all criteria by a wide margin, particularly the ACC metric. On ImageHear, the performance disparity is substantial. They believe this is because each sample has only one clean object, and GRAVO has learned to generate audio from soundable images successfully. The ImageHear class-wise accuracy of 30 classes. For the majority of classes, their technique outperforms Im2Wav.

Comparison table of GRAVO

They also conducted an ablation study to demonstrate the effectiveness of each module. The results are shown in Table. To begin, they can observe that GRAVO increases all metrics over Im2Wav even without MHA instruction. All GRAVO values are greater than 8.79. It means that MHA modifies the CLIP embedding to the target audio’s related features more than the original image sequence.

Final remarks about GRAVO

GRAVO, which generates audio that is related to visuals, is presented in this study. In order to accomplish this, the model employs an MHA module on top of pre-trained CLIP features to learn the intrinsic links between images. They also propose using Wav2CLIP embeddings to direct the MHA module’s behavior, allowing it to learn elements related to audio in images. GRAVO significantly improves generation quality across several criteria, according to experimental results.

Reference

https://www.isca-speech.org/archive/pdfs/interspeech_2023/ahn23b_interspeech.pdf


Similar Posts

    Signup MLNews Newsletter

    What Will You Get?

    Bonus

    Get A Free Workshop on
    AI Development