Grounding DINO - Zero-Shot Object Detection¶

Grounding DINO is a zero-shot object detection model developed by IDEA Research that can detect objects in images using arbitrary text prompts.

Overview¶

Grounding DINO combines the power of DINO (a self-supervised vision transformer) with grounded pre-training to enable open-vocabulary object detection. Key capabilities include:

Text-Prompted Detection - Detect objects using natural language descriptions
Zero-Shot Detection - No training required for new object classes
High Accuracy - State-of-the-art performance on open-vocabulary detection benchmarks
Flexible Prompting - Accepts single words, phrases, or detailed descriptions

Best for Common Objects

Grounding DINO is most effective at identifying common objects (e.g., cars, people, dogs). It is less effective at identifying uncommon or highly specific objects (e.g., a specific type of car, a specific person).

License & Attribution

License: Apache 2.0
Source: IDEA Research
Paper: Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Pre-trained Model IDs¶

Grounding DINO pre-trained models are available and require a Roboflow API key.

Model ID	Description
`grounding-dino`	Base model - balanced performance

Supported Backends¶

Backend	Extras Required
`torch`	`torch-cpu`, `torch-cu118`, `torch-cu124`, `torch-cu126`, `torch-cu128`, `torch-jp6-cu126`

Roboflow Platform Compatibility¶

Feature	Supported
Training	❌ Not available for training
Upload Weights	❌ Not supported
Serverless API (v2)	✅ Deploy via hosted API
Workflows	✅ Use in Workflows via Grounding DINO block
Edge Deployment (Jetson)	❌ Not supported
Self-Hosting	✅ Deploy with `inference-models` (GPU recommended)

Usage Examples¶

Basic Object Detection¶

import cv2
import supervision as sv
from inference_models import AutoModel

# Load model (requires API key)
model = AutoModel.from_pretrained("grounding-dino", api_key="your_roboflow_api_key")

# Load image
image = cv2.imread("path/to/image.jpg")

# Detect objects with text prompts
predictions = model(image, ["person", "car", "dog"], conf_thresh=0.35)
detections = predictions[0].to_supervision()

# Annotate image
bounding_box_annotator = sv.BoxAnnotator()
annotated_image = bounding_box_annotator.annotate(image, detections)

# Save or display
cv2.imwrite("annotated.jpg", annotated_image)

Using Phrase Descriptions¶

import cv2
from inference_models import AutoModel

# Load model
model = AutoModel.from_pretrained("grounding-dino", api_key="your_roboflow_api_key")
image = cv2.imread("path/to/image.jpg")

# Use detailed phrase descriptions
predictions = model(
    image,
    ["red car", "person wearing hat", "small dog"],
    conf_thresh=0.3
)

print(f"Found {len(predictions[0].xyxy)} objects")

Adjusting Detection Thresholds¶

import cv2
from inference_models import AutoModel

# Load model
model = AutoModel.from_pretrained("grounding-dino", api_key="your_roboflow_api_key")
image = cv2.imread("path/to/image.jpg")

# Adjust confidence threshold for more/fewer detections
predictions = model(
    image,
    ["person", "vehicle"],
    conf_thresh=0.25,  # Lower threshold = more detections
    text_thresh=0.25   # Text matching threshold
)

detections = predictions[0].to_supervision()
print(f"Detected {len(detections)} objects")

Workflows Integration¶

Grounding DINO can be used in Roboflow Workflows for complex computer vision pipelines. The Grounding DINO block accepts text prompts and returns object detections that can be combined with other blocks.

Learn more: Workflows Documentation

Performance Tips¶

Use GPU - Grounding DINO requires GPU for acceptable performance
Optimize prompts - Use clear, specific descriptions for better results
Adjust thresholds - Experiment with conf_thresh and text_thresh for your use case
Batch processing - Process multiple images together when possible
Choose the right model - Use tiny for speed, base for accuracy

Prompting Best Practices¶

✅ Use simple, common words: "car", "person", "dog"
✅ Be specific when needed: "red car", "person wearing hat"
✅ Use singular nouns: "car" instead of "cars"
❌ Avoid overly complex descriptions: "a blue sedan parked on the street"
❌ Avoid rare or technical terms: Use "car" instead of "automobile"