OWLv2 - Open-Vocabulary Object Detection¶
OWLv2 (Open-World Localization v2) is an open-vocabulary object detection model developed by Google Research that can detect objects using text prompts or visual examples.
Overview¶
OWLv2 is a vision transformer-based model that enables zero-shot and few-shot object detection. Key capabilities include:
- Text-Prompted Detection - Detect objects using natural language descriptions
- Visual Example Detection - Detect objects by providing visual examples (few-shot learning)
- Zero-Shot Detection - No training required for new object classes
- Open Vocabulary - Works with arbitrary object classes
Visual Examples Recommended
The current implementation in inference-models is optimized for visual example-based detection (few-shot learning). Text-only prompting may have limited support.
License & Attribution
License: Apache 2.0
Source: Google Research
Paper: Scaling Open-Vocabulary Object Detection
Pre-trained Model IDs¶
OWLv2 pre-trained models are available and require a Roboflow API key.
| Model ID | Description |
|---|---|
google/owlv2-base-patch14-ensemble |
Base ensemble model - balanced performance |
google/owlv2-large-patch14-ensemble |
Large ensemble model - higher accuracy, slower |
Supported Backends¶
| Backend | Extras Required |
|---|---|
hf (Hugging Face) |
Included in base installation |
Roboflow Platform Compatibility¶
| Feature | Supported |
|---|---|
| Training | ❌ Not available for training |
| Upload Weights | ❌ Not supported |
| Serverless API (v2) | ✅ Deploy via hosted API |
| Workflows | ✅ Use in Workflows via OWLv2 block |
| Edge Deployment (Jetson) | ❌ Not supported |
| Self-Hosting | ✅ Deploy with inference-models (GPU recommended) |
Usage Examples¶
Text-Based Detection (Open Vocabulary)¶
import cv2
import supervision as sv
from inference_models import AutoModel
# Load model (requires API key)
model = AutoModel.from_pretrained("google/owlv2-base-patch14-ensemble", api_key="your_roboflow_api_key")
# Load image
image = cv2.imread("path/to/image.jpg")
# Detect objects with text prompts
predictions = model(image, classes=["dog", "person", "car"])
detections = predictions[0].to_supervision()
# Annotate image
bounding_box_annotator = sv.BoxAnnotator()
annotated_image = bounding_box_annotator.annotate(image, detections)
cv2.imwrite("annotated.jpg", annotated_image)
Few-Shot Detection with Visual Examples¶
import cv2
from inference_models import AutoModel
from inference_models.models.owlv2.entities import ReferenceExample, ReferenceBoundingBox
# Load model
model = AutoModel.from_pretrained("google/owlv2-base-patch14-ensemble", api_key="your_roboflow_api_key")
# Load images
image = cv2.imread("path/to/image.jpg")
reference_image = cv2.imread("path/to/reference.jpg")
# Define reference examples with bounding boxes
reference_examples = [
ReferenceExample(
image=reference_image,
boxes=[
ReferenceBoundingBox(x=100, y=50, w=80, h=80, cls="logo"),
ReferenceBoundingBox(x=300, y=200, w=120, h=150, cls="product"),
],
)
]
# Detect similar objects
predictions = model.infer_with_reference_examples(
image,
reference_examples=reference_examples,
confidence_threshold=0.99,
iou_threshold=0.3
)
detections = predictions[0].to_supervision()
print(f"Found {len(detections)} objects")
Using Embeddings Cache for Better Performance¶
import cv2
from inference_models import AutoModel
from inference_models.models.owlv2.cache import (
OwlV2ClassEmbeddingsCache,
OwlV2ImageEmbeddingsCache,
)
# Create cache instances
class_embeddings_cache = OwlV2ClassEmbeddingsCache()
image_embeddings_cache = OwlV2ImageEmbeddingsCache()
# Load model with caching enabled
model = AutoModel.from_pretrained(
"google/owlv2-base-patch14-ensemble",
api_key="your_roboflow_api_key",
owlv2_class_embeddings_cache=class_embeddings_cache,
owlv2_images_embeddings_cache=image_embeddings_cache,
)
# First inference - embeddings will be cached
image = cv2.imread("path/to/image.jpg")
predictions = model(image, classes=["dog", "person"])
# Subsequent inferences with same image/classes will be faster
predictions = model(image, classes=["dog", "person"]) # Uses cached embeddings
Workflows Integration¶
OWLv2 can be used in Roboflow Workflows for complex computer vision pipelines. The OWLv2 block supports both text prompts and visual examples for flexible object detection.
Learn more: Workflows Documentation
Performance Tips¶
- Use GPU - OWLv2 requires GPU for acceptable performance
- Start with high confidence for few-shot - When using reference examples, start with
confidence_threshold=0.99and adjust down if needed - Provide good examples - For few-shot learning, provide clear, representative bounding box examples
- Choose the right model - Use
base-patch14-ensemblefor speed,large-patch14-ensemblefor accuracy - Cache embeddings - The model automatically caches embeddings for faster repeated inference
Few-Shot Learning Best Practices¶
When using visual examples:
- ✅ Provide clear examples: Use well-lit, unoccluded objects
- ✅ Multiple examples help: Provide 2-3 examples per class when possible
- ✅ Consistent examples: Use examples similar to target objects
- ❌ Avoid poor quality: Don't use blurry or partially visible examples
- ❌ Avoid extreme variations: Keep examples consistent in appearance
Common Use Cases¶
- Product Detection - Detect products using example images
- Logo Detection - Find logos by providing reference examples
- Custom Object Detection - Detect specialized objects without training
- Prototype Development - Quickly test detection ideas before full training