SAM2 - Interactive Segmentation¶
Segment Anything Model 2 (SAM2) is Meta AI's next-generation foundation model for interactive image and video segmentation. It improves upon SAM with better accuracy, speed, and video tracking capabilities.
Overview¶
SAM2 provides advanced interactive segmentation with:
- Improved accuracy - Better mask quality compared to SAM
- Faster inference - Optimized architecture with better speed/quality tradeoff
- Video support - Track objects across video frames
- Image embedding - Efficient caching for multiple segmentations
- Multi-mask output - Generate multiple mask proposals
- Hiera architecture - Efficient hierarchical vision transformer
License¶
Apache 2.0
Pre-trained Model IDs¶
All SAM2 models require a Roboflow API key.
| Model Size | Model ID |
|---|---|
| Tiny | sam2/hiera_tiny |
| Small | sam2/hiera_small |
| Base+ | sam2/hiera_b_plus |
| Large | sam2/hiera_large |
Supported Backends¶
| Backend | Extras Required |
|---|---|
torch |
torch-cpu, torch-cu118, torch-cu124, torch-cu126, torch-cu128, torch-jp6-cu126 |
GPU Recommended
Interactive segmentation models work best on GPU. CPU inference will be significantly slower.
Roboflow Platform Compatibility¶
| Feature | Supported |
|---|---|
| Training | ❌ No custom training |
| Upload Weights | ❌ Not applicable |
| Serverless API (v2) | ✅ Deploy via hosted API |
| Workflows | ✅ Use in Workflows |
| Edge Deployment (Jetson) | ⚠️ Experimental (may fail on devices with limited VRAM) |
| Self-Hosting | ✅ Deploy with inference-models |
Usage Example¶
import cv2
import numpy as np
from inference_models import AutoModel
# Load model (requires API key)
model = AutoModel.from_pretrained("sam2/hiera_b_plus", api_key="your_api_key")
image = cv2.imread("path/to/image.jpg")
# Step 1: Embed the image (optional but recommended for multiple segmentations)
embeddings = model.embed_images(image)
# Step 2: Segment with point prompts
# Positive point (foreground) at (100, 150)
point_coords = np.array([[100, 150]])
point_labels = np.array([1]) # 1 = foreground, 0 = background
results = model.segment_images(
embeddings=embeddings,
point_coordinates=point_coords,
point_labels=point_labels
)
# Access masks and scores
masks = results[0].masks # Shape: (num_masks, H, W)
scores = results[0].scores # Confidence scores for each mask
logits = results[0].logits # Low-resolution logits for refinement
# Use the best mask (highest score)
best_mask_idx = scores.argmax()
best_mask = masks[best_mask_idx]
Output Format¶
Returns: List[SAM2Prediction] (one per image in batch)
Each SAM2Prediction contains:
- masks - Binary masks (torch.Tensor)
- scores - Confidence scores for each mask
- logits - Low-resolution logits for mask refinement
Prompting Options¶
Point Prompts:
# Positive points (foreground)
point_coords = np.array([[x1, y1], [x2, y2]])
point_labels = np.array([1, 1])
# Negative points (background)
point_coords = np.array([[x1, y1]])
point_labels = np.array([0])
Box Prompts:
# Box format: [x_min, y_min, x_max, y_max]
boxes = np.array([[50, 50, 200, 200]])
results = model.segment_images(embeddings=embeddings, boxes=boxes)
Mask Refinement:
# Use previous mask logits to refine segmentation
results = model.segment_images(
embeddings=embeddings,
point_coordinates=point_coords,
point_labels=point_labels,
mask_input=previous_logits
)
Batch Processing¶
# Process multiple images efficiently
images = [cv2.imread(f"image_{i}.jpg") for i in range(4)]
embeddings = model.embed_images(images)
# Segment all images with the same prompt
point_coords = [np.array([[100, 150]])] * 4
point_labels = [np.array([1])] * 4
results = model.segment_images(
embeddings=embeddings,
point_coordinates=point_coords,
point_labels=point_labels
)
Performance Optimization with Caching¶
SAM2 supports two types of caching for significant performance improvements:
1. Image Embeddings Cache¶
The most important optimization - caches the compute-heavy image encoding step. When you need to segment the same image multiple times with different prompts, embeddings are computed only once.
from inference_models import AutoModel
from inference_models.models.sam2.cache import Sam2ImageEmbeddingsInMemoryCache
import cv2
import numpy as np
# Create cache with size limit (number of images to cache)
embeddings_cache = Sam2ImageEmbeddingsInMemoryCache.init(
size_limit=100, # Cache up to 100 image embeddings
send_to_cpu=True # Move cached embeddings to CPU to save GPU memory
)
# Load model with cache
model = AutoModel.from_pretrained(
"sam2/hiera_b_plus",
api_key="your_api_key",
sam2_image_embeddings_cache=embeddings_cache
)
image = cv2.imread("image.jpg")
# First call: computes embeddings (slow)
embeddings = model.embed_images(image, use_embeddings_cache=True)
# Refine with different prompts - embeddings reused from cache (fast!)
for point in [(100, 150), (200, 250), (300, 350)]:
results = model.segment_images(
embeddings=embeddings,
point_coordinates=np.array([[point[0], point[1]]]),
point_labels=np.array([1])
)
# Process results...
2. Low-Resolution Masks Cache¶
Caches the low-resolution mask logits from previous segmentations for iterative refinement. SAM2's cache is more sophisticated than SAM's, storing multiple mask variants per image.
from inference_models.models.sam2.cache import Sam2LowResolutionMasksInMemoryCache
# Create mask cache
masks_cache = Sam2LowResolutionMasksInMemoryCache.init(
size_limit=500, # Cache up to 500 mask logits
send_to_cpu=True
)
# Load model with both caches
model = AutoModel.from_pretrained(
"sam2/hiera_b_plus",
api_key="your_api_key",
sam2_image_embeddings_cache=embeddings_cache,
sam2_low_resolution_masks_cache=masks_cache
)
# First segmentation - mask logits cached automatically
results = model.segment_images(
image,
point_coordinates=np.array([[100, 150]]),
point_labels=np.array([1]),
use_mask_input_cache=True
)
# Refinement - uses cached mask logits as input
refined_results = model.segment_images(
image,
point_coordinates=np.array([[100, 150], [120, 160]]), # Add more points
point_labels=np.array([1, 1]),
use_mask_input_cache=True # Automatically uses cached logits
)
Cache Parameters¶
| Parameter | Description | Default |
|---|---|---|
size_limit |
Maximum number of entries to cache | Required |
send_to_cpu |
Move cached data to CPU to save GPU memory | True |
use_embeddings_cache |
Enable embeddings cache lookup/save | True |
use_mask_input_cache |
Enable mask logits cache lookup/save | True |
Performance Impact
- Without cache: Each segmentation requires full image encoding (compute-heavy)
- With embeddings cache: Subsequent segmentations on same image are significantly faster
- Major speedup for interactive annotation workflows where you refine prompts on the same image
- SAM2 is generally faster than SAM for the same quality level
Use Cases¶
SAM2 is ideal for:
- ✅ Interactive annotation - Quickly segment objects with minimal user input
- ✅ Data labeling - Accelerate dataset creation with point/box prompts
- ✅ Video object tracking - Track and segment objects across video frames
- ✅ Object isolation - Extract specific objects from images
- ✅ Mask refinement - Iteratively improve segmentation quality
- ✅ Zero-shot segmentation - Segment novel objects without training