MLNews

Open-Vocabulary SAM: Recognizing and Segmenting Simultaneously

Open-Vocabulary SAM: the ultimate duo in the field of computer vision by combining object recognition and object segmentation making it a unified framework. Researchers from S-Lab, Nanyang Technological University and Shanghai Artificial Intelligence Laboratory presented this outstanding model.

The model takes an image as input and after processing the model shows output in the form of segmenting or highlighting specified objects. The model recognizes the segmented objects and label it. The model has the ability to segment and recognize almost 22,000 classes from around the world.

Open-Vocabulary SAM workflow

Open-Vocabulary SAM is a smart model that is capable of doing two actions at once. Whereas, all the previous SAM models were able to do one task at a time. This latest model highlights and segments any object and performs real world recognition on that object at the lowest computational costs. This model is presented in two different modes that is Point Mode and Box Mode as a visual prompt. 

Point Mode: In point mode, a tiny dot is added on the object that needs to be segmented.

The above added clip is an example of point mode in which the user pointed the specific object and model segment the pointed area. The model mention the vocabulary of the pointed object in the text box below.

Box Mode: In box mode, a rectangle is drawn around the object that needs to be segmented.

Box Mode in Open-Vocabulary SAM

The image above is an example of box mode in Open-Vocabulary SAM. Input image is given to the model and two points are provided (starting point and ending point) for box. The model creates a box and after processing the model segment the specified area in the form of output. Also, the model mentions the vocabulary that is “eel” from the example above.

After running demo, the generated output of model is not satisfactory as it is unable to segment the specified area properly. It masks the output in a scattered and doted form which is not upto the mark.

In the segmentation model, points usually don’t provide strong hints and end users prefer box mode. Whereas, in Open-Vocabulary SAM the point prompts are also accurate due to the techniques used in this model. The demo is available at HuggingFace for the users to try it themselves. The code is also open source and available for the users on GitHub.

Technicalities of Open-Vocabulary SAM

Open-Vocabulary SAM is a lot better than the original SAM when it comes to segmenting the object. It is efficient without requiring extra parameters such as time, memory or space.

Open-Vocabulary SAM is the combination of SAM and CLIP. SAM is expert in creating outline or segmenting the object whereas CLIP is expert in recognizing the objects. Both combines in which CLIP shares its knowledge and SAM utilizes that knowledge to segment the objects.

The researchers presented two module SAM2CLIP and CLIP2SAM. In SAM2CLIP, SAM share its skills with CLIP with the help of translator and make sure that the CLIP understands. Later, in CLIP2SAM, CLIP shares its expertise with SAM. With the help of these modules Open-Vocabulary SAM become more effective and efficient.

Sometimes, the model is unable to predict the right object. From the image above it can be seen that the left side is showing a situation where it’s tough to tell the difference between table and coffee table, even for humans, because they are very similar. On the right side, there are objects partially blocking each other, making it difficult to distinguish between a bowl and a vase.

Wrap Up!

The researchers integrated SAM and CLIP in Open Vocabulary SAM for segmentation in an image. It has two modes that is Point Mode and Box Mode. The model has the ability to segment and recognize almost 22,000 classes from around the world.

References


Similar Posts

Signup MLNews Newsletter

What Will You Get?

Bonus

Get A Free Workshop on
AI Development