MLNews

Stable Video Diffusion: A text-to-video and image-to-video Generation Model

Presenting Stable Video Diffusion SVD for cutting-edge text-to-video and image-to-video synthesis, transforming 2D image models into dynamic video generators.

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani and Robin Rombach from Stability AI presented this latest state-of-the-art model.

Stable Video Diffusion SVD is a latent video diffusion model designed for advanced text-to-video and image-to-video generation. Latent diffusion models were developed to generate 2D images which are further modified by adding temporal layers to generate sequence of frames to form videos.

Stable Video Diffusion

Textual input is provided to the model for high quality 3D text-to-video and image-to-video generation. The code is available on GitHub and research paper is also open source, available on Arxiv. The practical applications of Stable Video Diffusion can be seen in the fields of entertainment industry, computer vision, healthcare and robotics.

All the previous models of video generation depend on diffusion model. The video modeling techniques were drawn from 2D image domains. Diffusion model is applied repeatedly to attain high resolution text-to-image and video synthesis but the process of separation of video pre-training at lower resolution and high quality finetuning requires further enhancements to create coherent images to form a video. There exist a diverse approaches to analyze the video data for training the model but a unified method for organizing, selecting, or preparing video data for training was missing. 

What is Stable Video Diffusion?

The Stable Video Diffusion identify this issue and proposed a solution of data curation by presenting a proper workflow of data organizing and maintaining to ensure its usability and reliability for other video generative models. With the help of these data curation technique a state-of-the-art text-to-video and image-to-video models was trained. The base model was fine-tuned on a smaller, high-quality
dataset for high-resolution downstream tasks such as text-to-video and image-to-video. This model is also capable of doing multi-view synthesis by generating multiple consistent views of any object.

The inherited assumptions about motion and 3D understanding was examined in this model by execution of experiments in specific domain. 

How does it works?

Video diffusion model was trained on large dataset of videos. For this purpose the data processing and curation method were introduced and identified three different training approaches for modeling generative videos. This approach consists of image pretraining; referring to a diffusion model that translates text into images in a two-dimensional format, video pretraining; this stage train extensive number of videos video finetuning; Fine-tuning the video involves refining the model using a smaller subset of high-quality videos, particularly at higher resolutions.

Dataset and Training

Curating Data for HQ Video SynthesisInitial dataset of long videos was collected for video pretraining stage which as a base data. This dataset Large Video Dataset (LVD) consists of 580M annotated video clips. The image pretraining effects were analyzed on 10M subset of LVD both with and without pretrained spatial weights. The above mentioned data curation approaches improve the training of video diffusion model.

Training Video Model:

Text-to-video model was fine-tuned on a high quality dataset of 1M samples. These samples consists of various object motions, steady camera motion with high visual quality. Base model was fine-tuned for 50k iterations at the resolution 576 ร— 1024.

No masking technique was used in image-to-video generation model. Two models were fine-tuned, one predicting 14 frames and another one predicting 25 frames.

For multi-view generation, the Stable Video Diffusion SVD model was finetuned on two datasets; A subset of Objaverse having 150K curated and CC-licensed synthesis 3D objects from the original dataset. The model was trained for 12k steps (โˆผ16 hours) with 8 80GB A100 GPUs using a total batch size of 16, with a learning rate of 1e-5.

Conclusion

For high-resolution, text-to-image and image-to-video synthesis, Stable Video Diffusion (SVD) was introduced which is an extremely efficient model. It also generate multi-view synthesis of with high precision and accuracy. It is a latent video diffusion with the introduction of three distinct stages of video model training. SVD offers a robust video representation, serving as a foundation for fine-tuning video models to achieve cutting-edge image-to-video synthesis and other pertinent applications, including LoRAs for camera control.

References


Similar Posts

Signup MLNews Newsletter

What Will You Get?

Bonus

Get A Free Workshop on
AI Development