MLNews

Realistic audio generation for silent videos

Get ready to start a journey where silence speaks words and technology brings calm to life. Prepared to be amazed by the power of audio in silence like you’ve never seen before. The AI researchers involved in the research of this work are Matthew Martel and Jackson Wagner.

Creating realistic audio effects for films and other forms of media is a difficult task. Right now, the primary method they employ is known as Foley art, in which trained artists produce sounds using everyday materials such as boxing gloves or shattered glass, which are then synced with the video to create fascinating audio tracks.

In this project, they’re attempting something similar, but with the assistance of deep learning. Rather than using objects that are real, they intend to teach a computer model to observe the film and generate realistic sounds to go with it. they believe this is achievable because of recent advances in combining realistic audio from many sources, such as using Wavenet with text commands. This model is divided into two parts: video encoders and a sound generator. The researchers employ a SampleRNN architecture for their sound generator.

Generating realistic audio for silent videos

Related work with realistic audio generation

Audio generation:

Audio-generating techniques have advanced significantly in recent years. In 2016, van den Oord et al. published “WaveNet: A Generative Model for Raw Audio,” which described a deep neural network architecture for generating raw audio waveforms that generated state-of-the-art results when applied to the text-to-speech task. Verma and Chafe show in their 2021 work, “A Generative Model for Raw Audio Using Transformer Architectures,” that a transformer design can be useful in creating audio at the waveform level. In this study, they use both of these discoveries to inform systems for waveform creation from silent video.

Video interpretation:

This work is additionally affected by advances in video interpretation technologies. “CLIP4Caption: CLIP for Video Caption,” a study published in 2021, explains how to caption videos. Tang et al. present a method for video captioning that makes use of a CLIP-enhanced video-text matching network. This method has been demonstrated to be helpful in encouraging the model to learn video aspects that are highly linked with text for text production. Similarly, for the realistic audio production assignment, the models we construct will need to learn video aspects that are significantly connected with audio.

realistic audio generation of piano and guitar

Prior Audio from video work:

This work is not the first to create an audio-from-video approach. Zhou et al. present a technique for producing realistic audio for movies from the visual environment in their 2018 publication, “Visual to Sound: Generating Natural Sound for Videos in the Wild.” The method approaches audio production as a conditional generation problem.

Realistic Audio for silent video model

Creating realistic waveforms from a visual context is beneficial for a variety of creative media jobs, such as creating soundtracks for animated films or improving realistic audio effects for live-action films. However, the task is difficult for a variety of reasons.

First, the realistic audio-creation problem, in which a system makes audio that is similar to that contained in a dataset (e.g., speech generation systems), is difficult in and of itself. Second The ability of video context to inform audio creation is fundamentally limited. Third, another problem is producing video context embedding that accurately encodes visual information important to the audio creation task.

To accomplish the challenge of realistic audio synthesis from silent video, they investigate three possible model designs.

The basic notion in all three situations is to enhance audio-generation models with visual context in the form of embeddings, which are linked with audio production streams at various phases of the computation. Given the appropriate visual circumstances, they train their models to produce two-channel audio similar to that found in YouTube videos and home films captured by them.

There are three architectures used in this model:

Deep fusion architecture: Deep-fusion-based architecture they build is unusual in that it produces full audio sequences connected with appropriate video frames rather than individual audio samples.

Dilated Wavenet CNN architecture: The raw audio context added to the video context embedding is the input to this architecture, which is integrated utilizing the video-to-audio transformation method.

Transformer: The third architecture they tested has the same high-level structure as the Wavenet-based architecture, but it replaces dilated sampling of audio to make training tractable for huge context sizes.

WaveNet raw audio generation for realistic audio generation for silent video

Future of realistic audio generation for silent video

Consider a world in which historic silent films are completely rebuilt with lifelike sounds, improving the viewing experience and bringing these movie precious stones to new generations. Furthermore, this technology may find uses in virtual reality, video games, and immersive storytelling, where it improves the realism and emotional effect of interactive media. It may also pave the way for new forms of creative content, allowing filmmakers, game developers, and artists to create totally new tales and experiences that blur the lines between reality and fiction.

Research details of realistic audio generation

This research introduces realistic audio generation for silent videos that is open to everybody. It’s available on ArXiv, a platform where academics share their work for free. Some samples of the audio generation are also available on GitHub. Anyone interested in learning from and applying this knowledge can do so because it is freely available to the public.

Potential applications for realistic audio generation

The research on providing realistic audio for silent video has a wide range of possible applications. Aside from entertainment and film restoration, this technology has the potential to alter usability by delivering enriched audio descriptions for the hearing-impaired, promoting equality in media consumption.

Furthermore, it has historical preservation potential, giving fresh life to silent films and archival videos for cultural heritage preservation. It can be used in education and training to provide realistic and interactive learning experiences ranging from medical simulations to historical videos, boosting understanding and involvement. Overall, the latest development in audiovisual technology has the potential to go beyond entertainment, influencing many parts of our lives, from accessibility and culture to education and beyond.

Methods used in realistic audio generation

This method for audio generation for silent video with a deep learning approach at a test time model is presented with both audio context and video contact here is the illustration of this process

Data source: For this work, they used two primary sources one from YouTube videos and homemade videos collected by the authors. Videos from YouTube are downloaded using the Python pytube library and homemade videos are collected using a smartphone.

Data processing: Processing and downsampling of the raw data was necessary due to memory and competition constraints. Audio data is read using the Python library moviepy video data is read using the library scikit-video and individual frames are resized using the library cv2. The length of audio and video frame array they clip the audio array by the length of the audio module and they clip the video frame array such that its length is equal to that of the audio array.

Audio generation process of realistic audio generation for silent videos

Video content embedding: To present this model the Video Contest embedding is generated to describe the last frame using the RESNet 3D CNN architecture. The residual block architecture utilizes video context embedding. These residual blocks are used to process the video input before being projected to the dimension of audio data.

Audio generation: The model collects information from both the audio and video contexts while producing realistic audio to go with the video. The model employs all of the audio samples created up to that moment at each step, with any missing samples filled in with zeros. Similarly, if video frames are missing, the video context is filled with zeros. The audio generated at each phase continues to exist until the entire audio sequence is saved as an MP3 file.

Deep fusion architecture: They evaluate, that the deep-fusion-based architecture they build is unusual in that it produces full audio sequences connected with appropriate video frames rather than individual audio samples. The architecture collects audio and video context from the previous video frames and processes it in parallel using ResNet CNN sub-blocks. Following each sub-block, the outputs of the audio and video processing streams are modified and added to each other.

Dilated Wavenet CNN architecture: This model generates an audio sequence, with the last piece serving as the next audio sample. The raw audio context added to the video context embedding is the input to this architecture, which is integrated utilizing the video-to-audio transformation method.

Transformer: The third architecture they tested has the same high-level structure as the Wavenet-based architecture, but it replaces dilated sampling of audio to make training tractable for huge context sizes. The second approach used less raw audio data as the audio context and fewer transformer parameters. For each forward pass, the model employs a linear decoder designed to output the next individual audio sample. Throughout the model, ReLU activations are employed, with a final Tanh activation at the end.

Loss function: All three models described above were trained and evaluated using MSE, MAE, and cross-entropy loss. Cross-entropy loss produced only respectable results, hence it was chosen as the major loss function for further model refinement.

Conclusion about this model

This study provides important insights into realistic audio generation for videos. It highlights the potential of employing advanced audio production methods, specifically transformer designs, even in the context of television. A sample-by-sample technique was successful in producing smoother music with exact peaks when needed. Introducing video context early in the audio creation process improves audio quality, aligning it more closely with the video material.

However, the models still had difficulty capturing complex or high-frequency sounds, which might be attributed to insufficient encoding in the video context or data down-sampling. For more promising outcomes, future studies will include scaling up the model and diversity of the training dataset, as well as investigating enhanced video context embedding approaches.

References

https://arxiv.org/pdf/2308.12408v1.pdf

https://github.com/jaxwagner/sound_from_video


Similar Posts

    Signup MLNews Newsletter

    What Will You Get?

    Bonus

    Get A Free Workshop on
    AI Development