MoonShot by Salesforce Research: Video Generation through the Blending of Text and Image

Written By: kinza.sabir
Last Updated On: January 11, 2024

MoonShot: Ever wished that video could be created by just using text, image or combination of both? This model is doing wonder by crafting personlized video, video editing and image animation in controllable manner. The researchers from Salesforce Research and Show Lab, National University of Singapore presented this innovative and creative model.

The model takes image and text as input and generates output in the form of video. The combination of text and image generate videos that are outside the box and extremely creative. Using this model different ideas can be presented in unique ways to generate something innovative and fresh.

Mostly, current models use textual description to make videos and they face great challenges as they lack control over video structure and appearance. But, this latest model is more versatile and flexible as it use both image and textual description at the same time to make the video according to the user requirement.

The above is the example of video generation using text and image simultaneously as an input. An image of a dog and a robot was given to the model along with the text “A dog is running on the grass” and “A robot is landing to fly” respectively. The output can be seen in the clip above.

Basic Architecture of MoonShot

The most integral part of MoonShot is “Multimodal Video Block” (MVB). It is considered to be the brain of the model which consist of two layers.

Spatial-Temporal Layers: This layer deals with how objects should integrate in the video and how objects should move and change its position over time. It handles the movement and appearance of different elements and objects in the video.

Decoupled Cross-Attention Layer: This layer consider both image and text simultaneously while creating the video.

Both of these layers in MVB helps to understand how elements and object should look in the video. It also pays attention to text and image to generate controllable video that matches the imagination of the user.

Specialized “image ControlNet modules” were utilised which is already familiar with the layout and geometry of the images. After using this technique the model does not require extensive training from the scratch to make the video more consistent without compromising the quality of the videos.

The above clip is the example of utilization of image ControlNet modules. From the generated output it can be clearly seen that this module is expert in generating high quality videos with accurate geometry and layout.

Comparing MoonShot with other video generation models such as AnimateDiff showed that this latest model works way better than other models. AnimateDiff requires extra effort for fine-tuning each visual element, exclusively and ultimately slows down the process.

From the above mentioned clip, it can be clearly seen the difference between MoonShot and AnimateDiff, the video generated by AnimateDiff is clearly showing flickering and inconsistency. In the example “A woman at the beach” the generated output of woman face is continuously changing with every frame whereas in the example “A cat eating spaghetti, cartoon” the appearance is continuously changing showing inconsistency. In videos generated by MoonShot is of very high-quality, smooth, consistent in appearance and controllable motion is present.

Wrap Up!

Moonshot is capable of producing high quality and controllable videos using both text and images but according to my point of view, PIA works way better in the category of image animation compared to MoonShot as PIA has the ability to create same image with diverse emotions like smiling, crying etc which can make the video generation idea more diverse and flexible.