SAM 3 Inference and Paper Explanation

SAM (Segment Anything Model) 3 is the latest iteration in the SAM family. It builds upon the success of the SAM 2 model, but with major improvements. It now supports PCS (Promptable Concept Segmentation) and can accept text prompts from users. Furthermore, SAM 3 is now a unified model that includes a detector, a tracker, and a segmentation model. In this article, we will shortly cover the paper explanation of SAM 3 along with the SAM 3 inference.

Jump to Download Code

The SAM series has been pioneering models in the world of computer vision. Starting from healthcare to automotive tasks, they now power a range of industries. In this article, we will see what makes SAM 3 even better.

We will cover the following topics in SAM 3:

Why SAM 3?
What changes in SAM 3 compared to SAM 2?
What is the architecture of SAM 3?
How was the dataset curated for SAM 3?
Inference using SAM 3.

Note: The paper explanation will not be exhaustive. We will cover some of the most important aspects briefly, making it easier to skim through.

Why SAM 3?

SAM 3 was introduced in the paper – SAM 3: Segment Anything with Concepts by authors from Meta.

SAM 3- Promptable Visual Segmentation — Figure 2. SAM 3 – Promptable Visual Segmentation

The primary motivation behind creating SAM 3 was to move beyond “segment one object per prompt”. Given a single prompt, be it text, box, point, or a mask, SAM 3 can find and segment all instances of the object belonging to the same category. The authors term this as Promtable Concept Segmentation (PCS).

In practice, SAM 3:

Supports open-vocabulary concept prompts, allowing users to segment all objects matching a noun phrase (e.g., “all red cars”).
Accepts text prompts and image exemplars, making it more flexible than point-based or box-based prompting.
Unifies detection, segmentation, and tracking into a single model.
Introduces a large, curated dataset and benchmark to train and evaluate concept-level segmentation.

Compared to previous iterations, SAM 3 significantly improves segmentation accuracy and robustness, especially in open-vocabulary and multi-instance scenarios.

What Changes in SAM 3 Compared to SAM 2?

SAM 3 brings several fundamental changes compared to SAM 2. Here are the most important ones.

Promptable Concept Segmentation (PCS)

SAM 2 focused on interactive segmentation, accepting bounding boxes, point coordinates, and masks from users. SAM 3 moves beyond, accepting text-based prompts along with interactive prompts. This enables concept-level understanding of an image. Furthermore, unlike SAM 2, SAM 3 can detect, segment, and track all objects belonging to the same category based on the prompt.

Unified Architecture

SAM 3 is not just a segmentation model anymore. It integrates a detection model (DETR-based), a segmentation model (Maskformer-based), and a tracker.

Presence Token and Presence Head

There will be instances when an object based on a text input might not be present in an image or video. In such cases, the noun phrase or prompt should never reach the detector for recognition and localization. The presence token and presence head handle that. SAM 3 adds a global presence head that predicts whether a prompted concept exists in the image at all. This decouples recognition from localization and improves robustness in difficult or negative cases.

SAM 3 Architecture

The authors put it aptly: “SAM 3 is a generalization of SAM 2, supporting the new PCS task as well as the PVS task”.

However, there are some major architectural changes as well to SAM 3. SAM 3 contains a perception encoder that acts as a common backbone for detection and tracking.

Here, we discuss the core components of the SAM 3 model.

Perception Encoder

This is a shared backbone between the detector and tracker. It also has the task of aligning the vision and language parts of the prompt. First, it encodes the images and then aligns them with the text prompts.

Prompt Condition Detector

The detector of SAM 3 follows the DETR model paradigm, using a transformer-based, query-driven decoder.

The DETR-style decoder outputs object queries that feed into MaskFormer-style instance segmentation heads, and the model also includes a binary semantic segmentation head that predicts pixel-level membership for the prompted concept.

Tracker and Memory Bank for Video

The tracker and memory back essentially remain the same as SAM 2.

For video inputs, SAM 3 maintains a memory of object appearances and propagates masks across frames, periodically re-aligning with detector outputs to prevent drift.

How Was the Dataset Curated for SAM 3?

The datasets and data engine are perhaps the most important aspects of the SAM 3 ecosystem. The data engine now employs a human + model in the loop process to create the Segment Anything – Concepts (SA-Co) dataset.

SAM 3 data engine. — Figure 4. SAM 3 data engine – with human annotators and AI annotators in the loop.

Here is a concise explanation of the data engine.

Step 1 – Initial Human Annotation

Images and videos are paired with noun phrases, and SAM 2 generates candidate masks. Humans verify mask quality and exhaustiveness.

Step 2 – AI Assisted Verification

The accepted/rejected labels from humans from step 1 are used to train an AI verifier. Essentially, the SAM 3 team trained a Llama 3.2 model here.

The trained model does mask quality verification and exhaustivity checks. Further, the Llama model is also trained to filter hard negative noun phrases.

Step 3 – Dataset Scaling and Domain Specific Collection

Next, the dataset is expanded with domain specific concepts using the AI models. Furthermore, the authors expand the dataset by mining concepts from an ontology based on Wikidata. This enables long-tail and fine-grained concepts.

Step 4 – Expansion to Video Dataset

The pipeline is extended to video datasets. This includes creating spatio-temporal mask (masklets) annotations using a mature SAM 3 model, and verification is done by humans.

Finally, SA-Co consists of three different datasets:

SA-Co/HQ: millions of high-quality annotated images with millions of unique noun phrases.
SA-Co/SYN: large-scale synthetic annotations used to expand coverage.
SA-Co/VIDEO: tens of thousands of annotated videos with concept-level labels.

I highly recommend going through the paper if you wish to get a detailed understanding of SAM 3. In the next section, we will move to inference using SAM 3.

Inference Using SAM 3

In this section, we will cover a minimal codebase for inference using SAM 3 for images and videos.

We will cover:

Setting up the environment for SAM 3.
SAM 3 inference on images.
SAM 3 inference on videos.

Directory Structure

Let’s take a look at the project directory structure once.

.
├── build_sam3
│   └── sam3
├── input
│   ├── image_1.jpg
│   ...
│   └── video_2.mp4
├── NOTES.md
├── outputs
│   ├── video_1_3sec_20251214_192510
│   ├── video_1_3sec_20251214_192755
│   └── video_2_3sec_20251214_200715
├── requirements.txt
├── requirements_win.txt
├── sam3_image.py
├── sam3_video.py
├── setup.bat
├── setup.sh
└── utils.py

The sam3_image.py and sam3_video.py are the primary scripts for image and video inference. The utils.py file contains minimal helper functions.
The input and outputs directories contain the inference data and results, respectively.
We have different setup files and requirement files for both Ubuntu and Windows for easier installation of requirements.
The build_setup directory gets automatically created when we run the setup.

You can download the zip file that comes with the article if you wish to run inference locally. All the Python files, setup files, and input data are provided in the zip file.

Download Code

Set Up and Requirements Installation for SAM 3 Inference

After downloading and extracting the zip file, you can run the following to set everything up.

Make sure that you create a virtual environment with Python 3.12. Anaconda/Miniconda is recommended.
Install PyTorch with CUDA support. The code for this article was tested with PyTorch 2.9.1 with CUDA 13.0 support.

pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu130

If you are using Ubuntu OS, run the following:

sh setup.sh

If you are using Windows OS, run the following:

.\setup.bat

Everything should get installed automatically. You might be occasionally asked to input password for system level installations.

SAM 3 Image Inference

Let’s jump into the SAM 3 image inference code.

The code is present in sam3_image.py file.

Import Statements and Argument Parser

The following code block contains the imports and the argument parsers needed.

import matplotlib.pyplot as plt
import argparse
import os

from PIL import Image
from sam3.model_builder import build_sam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor
from utils import overlay_masks
from pathlib import Path
from datetime import datetime


def parse_arguments():
    parser = argparse.ArgumentParser(
        description='Process an image with SAM3 using a text prompt.'
    )
    parser.add_argument(
        '--image-path',
        dest='image_path',
        type=str,
        help='Path to the input image',
        default='input/image_1.jpg'
    )
    parser.add_argument(
        '--prompt',
        dest='prompt',
        type=str,
        help='Text prompt for segmentation',
        default='cat'
    )
    parser.add_argument(
        '--visualize',
        dest='visualize',
        action='store_true',
        help='Visualize the result'
    )
    return parser.parse_args()

os.makedirs('outputs', exist_ok=True)

We are importing build_sam3_image_model function to load the SAM 3 model and Sam3Processor to later initialize the SAM 3 image processor.

We also import overlay_masks from the utils module that helps us overlay the segmentation mask on top of the input image.

Apart from that, we are defining argument parsers for input images, the text prompt for the object that we want to segment, and whether or not we want to visualize the final result.

Functions to Load Model, Process Image & Run Inference, and Save Results

The following code block contains three functions.

def load_model():
    """Load the SAM3 image model and processor."""
    model = build_sam3_image_model()
    processor = Sam3Processor(model)
    return model, processor


def process_image(processor, image_path, prompt):
    """
    Process an image with SAM3 using a text prompt.
    
    Args:
        processor: Sam3Processor instance
        image_path: Path to the input image
        prompt: Text prompt for segmentation
        
    Returns:
        tuple: (original_image, result_image, masks, boxes, scores)
    """
    # Load the image
    image = Image.open(image_path)
    
    # Set image and get inference state
    inference_state = processor.set_image(image)
    
    # Prompt the model with text
    output = processor.set_text_prompt(state=inference_state, prompt=prompt)
    
    # Get the masks, bounding boxes, and scores
    masks = output['masks']
    boxes = output['boxes']
    scores = output['scores']
    
    # Create overlay visualization
    result = overlay_masks(image, masks)
    
    return image, result, masks, boxes, scores


def save_result(result_image, input_image_path, output_base_dir='outputs'):
    """
    Save the result image to outputs/image_name_datetime/result.png
    
    Args:
        result_image: PIL Image to save
        input_image_path: Path to the input image (used to determine output directory name)
        output_base_dir: Base directory for outputs (default: 'outputs')
        
    Returns:
        str: Path where the result was saved
    """
    # Get the image name without extension
    image_name = Path(input_image_path).stem
    
    # Get current timestamp
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    
    # Create output directory: outputs/image_name_datetime/
    output_dir = Path(output_base_dir) / f"{image_name}_{timestamp}"
    output_dir.mkdir(parents=True, exist_ok=True)
    
    # Save result
    output_path = output_dir / "result.png"
    result_image.save(output_path)
    
    print(f"Result saved to: {output_path}")
    return str(output_path)

The load_model function loads the SAM 3 model and processor.

The process_image function accepts the processor, input image path, and text prompt as parameters.

It reads the image, sets the text prompt with the processors, and carries out the forward pass. We extract the masks and use the overlay_masks function to obtain the final result.

The save_result function saves the resulting image in a timestamped directory.

Main Function

The main function combines everything and manages the execution flow.

def main():
    """Main execution function."""
    args = parse_arguments()
    
    # Configuration
    input_image_path = args.image_path
    prompt = args.prompt
    
    # Load model
    print("Loading SAM3 model...")
    model, processor = load_model()
    
    # Process image
    print(f"Processing image: {input_image_path}")
    print(f"Prompt: '{prompt}'")
    image, result, masks, boxes, scores = process_image(processor, input_image_path, prompt)
    
    # Save result
    output_path = save_result(result, input_image_path)
    
    # Display result    
    if args.visualize:
        plt.figure(figsize=(10, 8))
        plt.imshow(result)
        plt.axis('off')
        plt.title(f"SAM3 Segmentation - Prompt: '{prompt}'")
        plt.tight_layout()
        plt.show()
    
    print(f"Found {masks.shape[0]} masks")
    print(f"Scores: {scores}")


if __name__ == '__main__':
    main()

Before running the code for the first time, ensure that you have been granted access to SAM 3 on Hugging Face and have logged in to Hugging Face hub via CLI.

Let’s run inference on an image.

python sam3_image.py --image-path input/image_1.jpg --prompt "cat"

The following is the result.

Figure 5. SAM 3 image inference on the image of a cat.

The result looks good when a single object is present.

Let’s try another image.

python sam3_image.py --image-path input/image_2.jpg --prompt "dogs"

Figure 6. SAM 3 image inference on multiple instances of dogs.

The model is able to successfully segment multiple objects belonging to the same category as well.

SAM 3 Video Inference

Now, let’s move on to the video inference part.

All the code for video inference using SAM 3 is present in the sam3_video.py file.

Imports and Argument Parser

import glob
import os
import cv2
import matplotlib.pyplot as plt
import argparse

from pathlib import Path
from datetime import datetime
from sam3.model_builder import build_sam3_video_predictor
from sam3.visualization_utils import (
    prepare_masks_for_visualization, 
    visualize_formatted_frame_output,
    save_masklet_video
)
from utils import propagate_in_video

def parse_arguments():
    parser = argparse.ArgumentParser(
        description='Process a video with SAM3 using a text prompt.'
    )
    parser.add_argument(
        '--video-path',
        dest='video_path',
        type=str,
        help='Path to video file (.mp4) or directory containing JPEG frames',
        default='input/video_1.mp4' 
    )
    parser.add_argument(
        '--prompt',
        dest='prompt',
        type=str,
        help='Text prompt for segmentation',
        default='cat'
    )
    parser.add_argument(
        '--frame-idx',
        dest='frame_idx',
        type=int,
        default=0,
        help='Frame index to add the prompt (default: 0)'
    )
    parser.add_argument(
        '--visualize',
        dest='visualize',
        action='store_true',
        help='Visualize the result'
    )
    return parser.parse_args()


os.makedirs('outputs', exist_ok=True)

This time, we import the build_sam3_video_predictor function to load the SAM 3 model for video inference. Also, we import functions for preparing the masks for visualization after inference, formatting the output, and saving the resulting video from visualization_utils.

The command line arguments remain almost the same, with one small change. We have an additional --frame-idx argument, which tells the pipeline which frame to select for initial segmentation.

Functions for Loading Model, Loading Video Frames, Inference, and Propagation

The following code contains all the essential functions for video inference.

def load_model():
    """Load the SAM3 video predictor model."""
    video_predictor = build_sam3_video_predictor(async_loading_frames=True)
    return video_predictor


def load_video_frames(video_path):
    """
    Load video frames for visualization.
    
    Args:
        video_path: Path to video file (.mp4) or directory containing JPEG frames
        
    Returns:
        list: List of video frames (numpy arrays in RGB format) or list of frame paths
    """
    # Load "video_frames_for_vis" for visualization purposes (they are not used by the model).
    if isinstance(video_path, str) and video_path.endswith(".mp4"):
        cap = cv2.VideoCapture(video_path)
        video_frames_for_vis = []
        while True:
            ret, frame = cap.read()
            if not ret:
                break
            video_frames_for_vis.append(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
        cap.release()
    else:
        video_frames_for_vis = glob.glob(os.path.join(video_path, "*.jpg"))
        try:
            # integer sort instead of string sort (so that e.g. "2.jpg" is before "11.jpg").
            video_frames_for_vis.sort(
                key=lambda p: int(os.path.splitext(os.path.basename(p))[0])
            )
        except ValueError:
            # fallback to lexicographic sort if the format is not "<frame_index>.jpg".
            print(
                f'frame names are not in "<frame_index>.jpg" format: {video_frames_for_vis[:5]=}, '
                f"falling back to lexicographic sort."
            )
            video_frames_for_vis.sort()
    
    return video_frames_for_vis


def process_video(video_predictor, video_path, prompt, frame_idx=0):
    """
    Process a video with SAM3 using a text prompt.
    
    Args:
        video_predictor: SAM3 video predictor instance
        video_path: Path to the video file or frame directory
        prompt: Text prompt for segmentation
        frame_idx: Frame index to add the prompt (default: 0)
        
    Returns:
        tuple: (session_id, initial_outputs, outputs_per_frame)
    """
    # Read video for retrieving framerate.
    video = cv2.VideoCapture(video_path)
    video_fps = video.get(cv2.CAP_PROP_FPS)
    video.release()
    
    # Start a session.
    response = video_predictor.handle_request(
        request=dict(
            type='start_session',
            resource_path=video_path
        )
    )
    session_id = response['session_id']
    
    # Add text prompt on specified frame.
    response = video_predictor.handle_request(
        request=dict(
            type='add_prompt',
            session_id=session_id,
            frame_index=frame_idx,
            text=prompt
        )
    )
    initial_outputs = response['outputs']
    
    # Propagate the outputs from frame 0 to the end of the video.
    print(f"Propagating masks through video... {video_fps=}")
    outputs_per_frame = propagate_in_video(video_predictor, session_id)
    
    return session_id, initial_outputs, outputs_per_frame, video_fps


def save_result(video_frames, outputs_per_frame, input_video_path, 
                output_base_dir='outputs', alpha=0.5, fps=10):
    """
    Save the result video to outputs/video_name_datetime/result.mp4
    
    Args:
        video_frames: List of video frames
        outputs_per_frame: Per-frame segmentation outputs
        input_video_path: Path to the input video (used to determine output directory name)
        output_base_dir: Base directory for outputs (default: 'outputs')
        alpha: Transparency of mask overlay (default: 0.5)
        fps: Frames per second for output video (default: 10)
        
    Returns:
        str: Path where the result was saved
    """
    # Get the video name without extension.
    video_name = Path(input_video_path).stem
    
    # Get current timestamp.
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    
    # Create output directory: outputs/video_name_datetime/
    output_dir = Path(output_base_dir) / f"{video_name}_{timestamp}"
    output_dir.mkdir(parents=True, exist_ok=True)
    
    # Save result video.
    output_path = output_dir / "result.mp4"
    
    save_masklet_video(
        video_frames,
        outputs_per_frame,
        out_path=str(output_path),
        alpha=alpha,
        fps=fps,
    )
    
    print(f"Result saved to: {output_path}")
    return str(output_path)

In the above code block, the load_video_frames function loads the frames only for visualization purposes.

The processing of frames for inference happens in the process_video function using the video_predictor.handle_request method (line 115). This loads all the frames into GPU memory. This also starts the current session.

We obtain a session ID from that run and call the method again to add the prompt along with the frame index (line 124).

Next, we obtain the initial results and propagate them through the rest of the video.

Finally, we save the results using the save_result function. We use the save_masklet_video function from the SAM 3 visualization utilities.

The Main Function

The following is the main function.

def main():
    """Main execution function."""
    # Configuration
    args = parse_arguments()
    video_path = args.video_path
    prompt = args.prompt
    frame_idx = args.frame_idx
    fps = 10
    alpha = 0.5
    vis_frame_stride = 1
    
    # Load model.
    print("Loading SAM3 video predictor...")
    video_predictor = load_model()
    
    # Load video frames for visualization.
    print(f"Loading video frames from: {video_path}")
    video_frames = load_video_frames(video_path)
    print(f"Loaded {len(video_frames)} frames")
    
    # Process video.
    print(f"Processing video with prompt: '{prompt}' on frame {frame_idx}")
    session_id, initial_outputs, outputs_per_frame, video_fps = process_video(
        video_predictor, video_path, prompt, frame_idx
    )
    
    print(f"Generated outputs for {len(outputs_per_frame)} frames")
    
    # Save result.
    output_path = save_result(
        video_frames, 
        outputs_per_frame, 
        video_path,
        alpha=alpha,
        fps=video_fps
    )
    
    print(f"Video processing complete!")
    print(f"Session ID: {session_id}")
    
    # Optional: Visualize specific frames.
    plt.close('all')    
    if args.visualize:
        outputs_per_frame = prepare_masks_for_visualization(outputs_per_frame)
        for frame_idx in range(0, len(outputs_per_frame), vis_frame_stride):
            visualize_formatted_frame_output(
                frame_idx,
                video_frames,
            outputs_list=[outputs_per_frame],
            titles=['SAM 3 Dense Tracking outputs'],
            figsize=(6, 4),
        )


if __name__ == '__main__':
    main()

The above function combines everything and runs the video inference.

Video Inference Execution

There are a few things to keep in mind while carrying out video inference using SAM 3.

With a 10GB GPU, ~10 seconds of 480p video can be processed. With an 8GB GPU, ~3 seconds of 480p video. We will also be enabling expandable GPU segments so that we can process videos without issues; otherwise, we may sometimes face an OOM error even though GPU memory is available.

The input directory contains several videos of different lengths. Here, we are experimenting with a 3-second video.

First, we enable expandable segment.

export PYTORCH_ALLOC_CONF=expandable_segments:True

Second, run the inference.

python sam3_video.py --video-path input/video_1_3sec.mp4 --prompt "cat"

The following is the result.

Video 1. SAM 3 video inference result. We are detecting, segmenting, and tracking a cat here using SAM 3.

The model is able to segment the cat accurately in all frames.

Summary and Conclusion

In this article, we covered a brief paper explanation of SAM 3, along with simple image and video inference. We can do much more. We are yet to cover visual prompts like boxes, points, and masks. In one of the future articles, we will cover them along with a proper UI for a better user experience.

If you have any questions, thoughts, or suggestions, please leave them in the comment section. I will surely address them.

You can contact me using the Contact section. You can also find me on LinkedIn, and X.

Liked it? Take a second to support Sovit Ranjan Rath on Patreon!

SAM 3 Inference and Paper Explanation

Why SAM 3?

What Changes in SAM 3 Compared to SAM 2?

Promptable Concept Segmentation (PCS)

Unified Architecture

Presence Token and Presence Head

SAM 3 Architecture

Perception Encoder

Prompt Condition Detector

Tracker and Memory Bank for Video

How Was the Dataset Curated for SAM 3?

Step 1 – Initial Human Annotation

Step 2 – AI Assisted Verification

Step 3 – Dataset Scaling and Domain Specific Collection

Step 4 – Expansion to Video Dataset

Inference Using SAM 3

Directory Structure

Download Code

Set Up and Requirements Installation for SAM 3 Inference

SAM 3 Image Inference

Import Statements and Argument Parser

Functions to Load Model, Process Image & Run Inference, and Save Results

Main Function

SAM 3 Video Inference

Imports and Argument Parser

Functions for Loading Model, Loading Video Frames, Inference, and Propagation

The Main Function

Video Inference Execution

Summary and Conclusion

1 thought on “SAM 3 Inference and Paper Explanation”

Leave a Reply Cancel reply