Grounded SAM: A Unified Model for Diverse Visual Tasks

Now you are getting a little more than detection and segmentation because Grounded SAM is here. A new model has been presented by International Digital Economy Academy (IDEA) & Community. The model takes text and image as an input and generates the segmented output by generating tags, boxes and labels. Grounded SAM is able to handle complex visual tasks. 

Workflow of Grounded SAM
Workflow of Grounded SAM

Grounded SAM utilizes grounding DINO to identify objects and merge with SAM (Segment Anything Model). The combination of both models identify and separate different regions based on the given text prompt. It also provides an opportunity to link different models together.

Grounded DINO and SAM
Grounded DINO and SAM

The basic components of this model are discussed below;

  1. Segment Anything Model (SAM):  The Segment Anything Model (SAM) is an open-world segmentation model which accurately segments any element and object from an image, by giving points, boxes or text prompts. It is trained on a large dataset of 11 million images and 1.1 million masks and it performs well in the situations where it hasn’t seen specific examples before. Besides this, SAM requires more specific instructions like points or boxes rather than just textual input to identify masked objects.
  2. Grounding DINO: It is an object detector that helps to find objects based on any text description. It was also trained on an extensive dataset of over 10 million images, including information about object detection, visual grounding, and pairs of images with text. However, it also requires text inputs and can only detect boxes when given specific input. 

Efficient Image Editing Through Grounded SAM using SD

By adding an image generation model, precise and controlled changes to images can be done, like altering the representation, replacing objects, or removing specific regions. Grounded SAM’s users can mask accurately by clicking or drawing boxes. Additionally, users can use grounding along with text prompts to automatically find specific areas to be edited.

Image Editing Through Grounded SAM
Image Editing Through Grounded SAM

OSX and Grounded SAM

OSX is a cutting-edge model which is designed to create a detailed 3D mesh of the whole body. It identifies 3D poses, facial expressions and hand movements from images. It detects humans, crop and adjust boxes and then create a detail of a specified person.

Combining Grounded SAM with OSX creates an extremely versatile and flexible model that detects and reconstructs the complete body of a specific person based on a given text prompt. Grounded SAM pinpoints the specific person and OSX creates a 3D mesh of that person.  It can analyze human body movements based on given instructions.


Recognize Anything Model (RAM): It is a powerful model for tagging images. It can identify different things but cannot create specific boxes or masks for each recognized category.

The capabilities of Grounding DINO is used in Grounded SAM framework. The labels and tags from RAM are used in the Grounded SAM to generate mask and boxes of around each and every object which helps to labels whole image making an automated system of labelling. This helps to reduce the computational cost.

Grounded SAM combines with RAM
Grounded SAM combines with RAM for labelling


The expertise of grounded DINO and SAM were combined to form Grounded SAM. It was evaluated on the benchmark SegInW (Segmentation in the wild). It has the ability to understand and segment objects in an extremely diverse and wild environment like real world images.

After testing the generated result is impressive by showing the mean Average Precision of 48.7. This average shows that the model is precise and accurate in understanding, recognizing and locating objects in real world images. The evaluation showed that these models work together to understand and analyze images efficiently.

Wrap Up!

Grounded SAM along with its extensions has the ability to address a wide range of image detection and segmentation related tasks. By combining different expert models opens a new era of research and applications in the field of computer vision. The researchers provided the demo of Grounded SAM but it is unable to run.


Similar Posts

Signup MLNews Newsletter

What Will You Get?


Get A Free Workshop on
AI Development