SynCLR: Google Secretly Revealed Learning from Imagination

Written By: kinza.sabir
Last Updated On: January 9, 2024

SynCLR: An approach that teaches the model for realistic visuals through imagination in the form of pictures and its related captions. The researchers from Google and MIT CSAIL presented this model. This model is not explicitly launched by the Google team but it is work presented by a researcher while interning there.
SynCLR is a model that recognize and understand images, but the most interesting part is that it do not learn from realistic visual instead it is completely trained on imaginary picture and captions. The researchers refer this model as “learning from models”.

With the help of Large Language Models (LLMs), numerous fake captions were created using concept “c”. These caption are brief description of the concept. Next, with the help of text-to-image models, images were created using these fake captions. For every caption there is corresponding imaginary image. Using this technique a big dataset of 600M synthetic images was created. Now, there is a set of synthetic images and their synthetic captions. So, the model is trained to understand the pictures without using real images.

The above image is showing the whole process of SynCLR. A concept of “power plant” is given to the LLM which generate the caption regarding the concept. Now, with the help of Text-to-Image diffusion model multiple images were generated regarding the given concept

Models with synthetic data have several pros and cons which still need further exploration. Models with special control over data such as hidden and specific settings and special conditions provides the way to control and customize the data. Such data take less space which is easier to save and share.

To make the images useful, it is necessary for the model to understand the images. So the model was trained with the combination of two methods. multi-positive contrastive learning teaching the model to understand the difference and similarities among various images and masked image modeling which involves learning from images and understanding whole picture even if some details are skipped.

Through the method of teaching using made-up images and their respective made-up captions, SynCLR becomes extremely proficient in understanding realistic images just like other advance methods such as CLIP and DINOv2.

Technicalities of SynCLR

The templates to generate caption efficiently in SynCLR are explained below;

c –> caption
c, bg –> caption
c, rel–> caption

The templates to generate caption efficiently are explained below;

c –> caption: When a concept or an idea (lets represent it with “c”) was given to Llama-2 it generate multiple sentences describing that concept. Below are some examples of caption regarding the concept “c”.

c, bg –> caption: Images can be created by combining the visual concepts (lets represent it with “c”) and background (lets represent it with “bg”). This will create the images that makes sense and realistic using text-to-image models. This approach has problem that sometimes cations combinations are given that doesn’t make any sense. For instance, if we randomly choose a visual concept like a “blue whale” and a background like a “football field,” it creates an image that doesn’t make much sense in reality. Below are some examples of concept “c” with background “bg”.

c, rel–> caption: This will add more details to the images by describing the objects positions or relationships (lets represent it with “rel”). For instance, if “cat” is the concept, and “in front of” is the relationship word, they want the smart language model (LLM) to make sentences like “a cute yellow cat is enjoying the fish in front of the sofa.” to add variation 10 different words were chosen to describe the relationship such as “in front of,” “behind,” “beside,” and more. Below are some examples of concept “c” with relationship “rel”.

This approach works very well for image classification tasks such as

Object Detection: This task is performed by classifying and locating objects within an image. This includes drawing bounding boxes around objects and labeling them.
Image Classification: This task is performed by identifying objects or categories within images. For example, distinguish between small and large fish in the image.
Image Captioning: This task is specifically about creating descriptive captions for images. This involves understanding the content of an image and describing it in human understandable language.
Semantic Segmentation: This task is performed by identifying and segmenting different parts of images at the level of pixels. For example, separating foreground objects from the background.
Image Generation: This task involves generating new and creative images according to the given instructions or modifying existing images.

object detection, image segmentation, visual question

Wrap Up!

SynCLR is a generative model which takes concept as an input and generate caption regarding that concept. Furthermore, the generated caption create multiple images using Text-to-Image model. These models are flexible as they can create wide range of images based on the concept given by the user.