MLNews

FusionFrames: An Efficient model for Text-to-Video Generation

Are you prepared for the excitement of the latest technology? The latest innovation FusionFrames will revolutionize the technology of video generation.

FusionFrames model is presented by Vladimir Arkhipkin, Zein Shaheen, Viacheslav Vasilev, Elizaveta Dakhova, Andrey Kuznetsov and Denis Dimitrov from Sber AI, Moscow Institute of Physics and Technology, and Artificial Intelligence Research Institute.

This paper presents an innovative two-stage latent diffusion text-to-video generation framework that builds upon the text-to-image diffusion model. This model promises an exciting leap forward in creating dynamic videos from text inputs. The initial stage focuses on synthesizing keyframes to outline the video’s storyline, while the subsequent stage is dedicated to generating interpolation frames that ensure seamless movements for scenes and objects.

FusiomFrames takes a textual input and generates the output in the form of video based on text-to-image diffusion model. 

Workflow of FusionFrames

Up-to-date researches has expanded the Text-to-Image (T2I) diffusion-based architecture to enable remarkable advancements in Text-to-Video (T2V) generation. Video generation models employs VAEs, GANs, normalizing flows and autoregressive transformers. Some models operate in pixel space while other utilize latent space.

The prior models has architecture that require maximum time for inference and computational costs. They also require high-quality, and large open-source text and videos dataset. The available datasets are not sufficient when training of the model is done from the scratch.

Exploration of FusionFrames

In recent years, approaches for generating images from text (T2I) have delivered remarkable outcomes. The generation of videos is a natural and logical progression in the development of this direction.

Example of FusionFrames

To achieve a high level of realism and aesthetic appeal in video generation, it’s essential to ensure not just the visual quality of individual frames but also coherence across frames regarding semantic content and appearance, smooth transitions of objects between adjacent frames and accurate depiction of movement physics are also essential factors. These aspects are known as temporal information which is necessary for video modality.

Diffusion models incorporate temporal information by integrating temporal convolutional layers or temporal attention layers into their architecture. This enables the initialization of the weights for the spatial layers with those from the pre-trained Text-to-Image (T2I) model, allowing the training to focus solely on the temporal layers. This technique reduce the requirement for extended text-video pairs datasets. Latent diffusion models also reduces the computational cost.

The code is available on GitHub and research paper is also open source, available on Arxiv.

Methodology

This model is divided in two stages; Keyframe Generation Stage: Crafted to oversee the primary storyline of the video and Interpolation Stage in the context of enhancing smooth movement, achieved by generating extra frames. This division enables consistency between the text description and the video, encompassing both content and dynamics, across the video.

Separate temporal blocks were proposed for processing temporal information. A highly efficient interpolation architecture was proposed that operates more than three times faster than other prevalent masked frame interpolation models, producing interpolated frames of superior fidelity.

Dataset for FusionFrames

The keyframe generation model’s internal training dataset comprises 120,000 text-video pairs, and this same dataset was utilized for training the interpolation model.

Performance

A comparison was done among different trained models using CLIPSIM on MSR-VTT and FVD, IS on UCF 101. The findings clearly shows that incorporating temporal blocks, as opposed to temporal layers, notably improves the quality.

Evaluation

Wrap Up

FusionFrames explored various aspects of Text-to-Video (T2V) architecture design to achieve the highest achievable output quality. This involved creating a two-stage model for video synthesis and exploring diverse methods of integrating temporal information, such as temporal blocks and temporal layers.  

The execution time of the interpolation architecture in FusionFrames is notably more efficient, surpassing the well-known masked frame interpolation approach by over three times in effectiveness.

References


Similar Posts

Signup MLNews Newsletter

What Will You Get?

Bonus

Get A Free Workshop on
AI Development