MLNews

Osprey: Understanding of Visual through Different Prompting Styles

Revolutionizing the combination of language and vision through Osprey–the dawn of a new era in visual language comprehension where every pixel is telling a tale. Researchers from Zhejiang University, Ant Group, Microsoft and The HongKong Polytechnical University presented this superior model. 

The model Osprey enhanced the MLLMs in form of visual language models. The model provide a comprehensive, precise and fine-grained visual understanding of image in the form of mask-text instructions approach. It generates the both short and detailed semantic description of the input mask region.

Osprey

The above mentioned image is showing some of the examples of this model. From the image it can be easily seen that the user select any region in the form of point or box and ask something about any specific region which can be easily identified by the segmentation model Osprey.

Do you find it captivating? Lets examine in more details. The model generates multiple output in different interesting form of point-prompt, box-prompt and segmentation everything modes to showcase the semantic association with the objects and its surroundings. You can try there interactive demo and I assure you that this model showed an outstanding results.

Point-Prompt in Osprey

Point Prompt

Come and experience with me the functionality of point-prompt. I choose the image from the provided gallery but you can also upload the image of your own choice. I pointed the region at the top of the house with the “short description” and it would take your about 1~ seconds to generate the segmentation results and you can see the output. It will generate the mask on the pointed object. At this moment you want to switch to another object just click on that object to find its details. I found this interactive model quite interesting, would you like to try it?

Point Prompt

Box Prompt in Osprey

Box Prompt

Like point-prompt I also run the model from the images provided by the researchers but this time with the detail description. It would take a longer time to 2~ seconds for detailed description. Lets see it results in the images given below.

Box Prompt

Segment Everything in Osprey

Segment Everything

Now, we are heading towards the most interesting mode of Osprey. Here you can segment each and every detail present in the image, without skipping any minute object. Just a click for segmenting everythhing and not only this you can point out any masked object in the output to see what this model has to say about that it specfic object. You might be thinking that it will take much time, right? No, it would take about 1.5~ seconds to generate the segmentation result and 0.8~ seconds of the short description.

Segment Everything
Segment Everything

The models shows best performance as the input image size increases to 800×800 but to keep the balance between computational cost and performance the suitable size for image input is 512×512.

The Osprey is the approach that extends the abilities of Multimodal large language models (MLLMs). It is a visual models that has masking feature which facilitate the end user to capture minute details with different granularity. A large scale instruction dataset was constructed coupling mask-text together named as Osprey-724K  having different levels of details that are object-level, part-level, and additional instructions. This model has significantly outperformed the previous models by showing extensive range of region perception tasks.

The model comprises of image-level vision encoder, a pixel-level mask-aware visual extractor and a large language model (LLM). Tokenization and conversion is performed om the input image, LLM receive the masked and language embedded image to show the outcome in the form of semantic understanding.

The model has wide range applications in various industries such as detail scene interpretation, image recognition or object detection. The research is available on Arxiv, code is available at GitHub whereas the demo of Osprey is also available to have more clarity about this approach for end-users.

Dataset Osprey-724K

Osprey-724K comprises of object-level and part-level mask coupled with text instruction data, which is generated on the open source datasets having 724k multimodal dialogues for image’s understanding at the level of pixels.

Training and Evaluation

The training of model consist of three levels; Image-Text Alignment Pre-training using CLIP vision encoder, Mask-Text Alignment Pre-training collecting short text and pixel-level mask pairs from available object-level datasets such COCO and End-to-End Fine-tuning Osprey-724K dataset is ulitized at this stage. The training was conducted on four NVIDIA A100 GPUs with 80GB memory.

Conclusion

Osprey has the ability to understand region of image at the part-level as well as object-level providing the alignment at pixel-level among language and vision. Osprey showed more advance features compared to Aligning and Prompting Everything (APE)

References


Similar Posts

Signup MLNews Newsletter

What Will You Get?

Bonus

Get A Free Workshop on
AI Development