Segment Anything (Meta)

English

Cut out any object from any image

Segment Anything by Meta features the Segment Anything Model (SAM), a cutting-edge AI model from Meta AI that lets you “cut out” any object from any image with just a single click.SAM is a promptable segmentation system with zero-shot generalization to unfamiliar objects and images, without the need for additional training.

Segment Anything (Meta) Features

Training the model: SAM’s data engine

SAM’s advanced capabilities stem from being trained on millions of images and masks collected using a model-in-the-loop data engine. Researchers used SAM and its data to interactively annotate images and update the model. This process was repeated iteratively to improve both the model and the dataset.

11M images, 1B+ masks

After annotating a sufficient number of masks with the help of SAM, we were able to fully automate the annotation of new images using SAM’s advanced prompt-aware design. To do this, we provided SAM with a grid of points on the image and instructed it to segment everything at each point. Our final dataset contains over 1.1 billion segmentation masks, collected from approximately 11 million licensed and privacy-protected images.

Efficient & flexible model design

SAM was designed to enable its efficient data engine to operate at full capacity. The model is divided into:

a one-time image encoder, and
a lightweight mask decoder, which can run in a web browser in just a few milliseconds per prompt.

Segment Anything (Meta) FAQs

What type of prompts are supported?

Foreground/background points
Bounding box
Mask
Text prompts are explored in our paper but the capability is not released

What is the structure of the model?

A ViT-H image encoder that runs once per image and outputs an image embedding
A prompt encoder that embeds input prompts such as clicks or boxes
A lightweight transformer based mask decoder that predicts object masks from the image embedding and prompt embeddings

What platforms does the model use?

The image encoder is implemented in PyTorch and requires a GPU for efficient inference.
The prompt encoder and mask decoder can run directly with PyTroch or converted to ONNX and run efficiently on CPU or GPU across a variety of platforms that support ONNX runtime.

How big is the model?

The image encoder has 632M parameters.
The prompt encoder and mask decoder have 4M parameters.

How long does inference take?

The image encoder takes ~0.15 seconds on an NVIDIA A100 GPU.
The prompt encoder and mask decoder take ~50ms on CPU in the browser using multithreaded SIMD execution.

What data was the model trained on?

The model was trained on our SA-1B dataset. See our dataset viewer.

How long does it take to train the model?

The model was trained for 3-5 days on 256 A100 GPUs.

Does the model produce mask labels?

No, the model predicts object masks only and does not generate labels.

Does the model work on videos?

Currently the model only supports images or individual frames from videos.

Where can I find the code?

Code is available on GitHub