DINOv: Visual In-context Prompting

Written By: kinza.sabir
Last Updated On: December 11, 2023

Did you find any application that instantly helps you find out objects from a dense image? A solution to this challenge – DINOv is here to make it easier for you.

Researchers Feng Li, Qing Jiang, Hao Zhang, Tianhe Ren, Shilong Liu, Xueyan Zou, Huaizhe Xu, Hongyang Li, Chunyuan Li, Jianwei Yang, Lei Zhang, Jianfeng Gao from HKUST, Microsoft Research, Redmond, IDEA, SCUT, Tsinghua, UW-Madison presented this interactive model.

In recent times, Large Language Models (LLMs) has the ability to perform many tasks generating creative content without any specific training but this practice lacks in the fields of vision tasks. The existing models pick up the most common objects in the picture and skip some of the important details which are needed.

These models are not capable enough to handle some important task relevant to images such as identification and spotting objects. This might need some extra effort to identify the objects they haven’t trained for.

In this model DINOv, a latest “prompt encoder” is designed to make the images more understandable to the computer, this new addition will make it more flexible and helps the system to follow different types of instructions and execute those instructions in different ways. The generated output can be seen in the form of segmentation masks which make the image more understandable. DINOv2 is a prior model that can learn from any collection of photos since it utilizes self-supervision.

The research paper of DINOv is available on Arxiv and code is available on GitHub. For visualization and demos, researchers recommended using T-Rex demo link, which is another visual prompting tool with similar properties as DINOv.

How DINOv works

DINOv is an adaptable visual prompting model for detection and segmentation based on MaskDINO. This model supports generic segmentation and referring segmentation on multiple or single input visual prompt. Multiple in-context visual prompts can be given as an input for improved segmentation performance.

DINOv looks at whole picture but also give importance to the small objects. To understand this lets take an example. Did you ever played a puzzle games having dense images? And the maker of that puzzle tells you to focus on a specific piece. This model is doing the same thing for you, you are going to point out any specific objects which will act as a clue or hint that will help the model to understand the picture in a better way. The best part is that you can point out as much clues as you want.

DINOv is a unified framework that perform generic segmentation and referring image segmentation. Image features are extracted through vision encoders. Visual In-context prompting was extended to sustain generic visual tasks as shown in the image below.

Thorough experiments and use of visual aids extensively demonstrated that model is capable of managing various tasks related to identifying objects in images and videos. Initial trials show encouraging outcomes, especially when dealing with new object identification and detection using visual cues.

Dataset and Experiments

For experimentation data was categorized into 2 types; segmentation data with semantic labels, using COCO2017 and segmentation data with only pixel annotation, dataset with around 110K images. Model was evaluated on different tasks and dataset. The results show that this model showed significantly better performance compared to previous models.

Conclusion

DINOv is a comprehensive system designed for in-context visual prompting and effeicently worked for referring segmentation and generic segmentation tasks. DINOv showcases remarkable abilities in both referring to specific elements and understanding general features within images, using in-context visual cues to guide detection and identification.

DINOv performs comparably well to close-set segmentation methods when tested on datasets that are similar or closely related to what it was trained on. Additionally, it shows encouraging outcomes when applied to various open-set segmentation benchmarks, which involve scenarios where the model encounters new or unseen objects.