CLIP - Embeddings Model¶

CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on image-text pairs that can generate embeddings for both images and text in a shared vector space, enabling powerful zero-shot classification and similarity search.

Overview¶

CLIP is a versatile embeddings model with multiple capabilities:

Image Embeddings - Generate vector representations of images
Text Embeddings - Generate vector representations of text
Similarity Comparison - Compare images and text in a shared embedding space
Zero-shot Classification - Classify images without task-specific training
Semantic Search - Find images based on text descriptions

License & Attribution

License: MIT
Source: OpenAI CLIP
Paper: Learning Transferable Visual Models From Natural Language Supervision

Pre-trained Model IDs¶

CLIP pre-trained models are available and do not require a Roboflow API key.

Model ID	Description
`clip/RN50`	ResNet-50 backbone - fast inference
`clip/RN101`	ResNet-101 backbone - better accuracy
`clip/RN50x4`	ResNet-50 4x - higher capacity
`clip/RN50x16`	ResNet-50 16x - very high capacity
`clip/RN50x64`	ResNet-50 64x - highest capacity ResNet
`clip/ViT-B-16`	Vision Transformer Base with 16x16 patches
`clip/ViT-B-32`	Vision Transformer Base with 32x32 patches - balanced performance
`clip/ViT-L-14`	Vision Transformer Large with 14x14 patches - high accuracy
`clip/ViT-L-14-336px`	Vision Transformer Large with 336px input - highest accuracy

Supported Backends¶

Backend	Extras Required
`torch`	`torch-cpu`, `torch-cu118`, `torch-cu124`, `torch-cu126`, `torch-cu128`, `torch-jp6-cu126`
`onnx`	`onnx-cpu`, `onnx-gpu`

Roboflow Platform Compatibility¶

Feature	Supported
Training	❌ Not supported
Upload Weights	❌ Not supported
Serverless API (v2)	✅ Available for embeddings and comparison
Workflows	✅ Use in Workflows via CLIP block
Edge Deployment (Jetson)	✅ Supported with appropriate backend
Self-Hosting	✅ Deploy with `inference-models`

Supported Tasks¶

CLIP supports multiple embedding and comparison tasks:

Task	Method	When to Use
Image Embeddings	`embed_image()`	Generate vector representations of images for similarity search or clustering
Text Embeddings	`embed_text()`	Generate vector representations of text descriptions
Similarity Comparison	`compare()`	Compare images with text or other images to find semantic similarity

Usage Examples¶

Image Embeddings¶

import cv2
from inference_models import AutoModel

# Load model
model = AutoModel.from_pretrained("clip/ViT-B-32")
image = cv2.imread("path/to/image.jpg")

# Generate image embedding
embedding = model.embed_image(image)
print(f"Embedding shape: {embedding.shape}")

Text Embeddings¶

from inference_models import AutoModel

# Load model
model = AutoModel.from_pretrained("clip/ViT-B-32")

# Generate text embedding
text_embedding = model.embed_text("a photo of a cat")
print(f"Text embedding shape: {text_embedding.shape}")

Image-Text Similarity¶

import cv2
from inference_models import AutoModel

# Load model
model = AutoModel.from_pretrained("clip/ViT-B-32")
image = cv2.imread("path/to/image.jpg")

# Compare image with text descriptions
similarity = model.compare(
    subject=image,
    prompt=["a photo of a cat", "a photo of a dog", "a photo of a bird"],
    subject_type="image",
    prompt_type="text"
)
print(f"Similarities: {similarity}")

Zero-shot Classification¶

import cv2
from inference_models import AutoModel

# Load model
model = AutoModel.from_pretrained("clip/ViT-B-32")
image = cv2.imread("path/to/image.jpg")

# Classify image with text prompts
classes = ["cat", "dog", "bird", "car", "tree"]
prompts = [f"a photo of a {cls}" for cls in classes]

similarities = model.compare(
    subject=image,
    prompt=prompts,
    subject_type="image",
    prompt_type="text"
)

# Get the most similar class
best_match_idx = similarities.index(max(similarities))
print(f"Predicted class: {classes[best_match_idx]}")
print(f"Confidence: {similarities[best_match_idx]:.2f}")

Batch Processing¶

import cv2
from inference_models import AutoModel

# Load model
model = AutoModel.from_pretrained("clip/ViT-B-32")

# Load multiple images
images = [cv2.imread(f"path/to/image{i}.jpg") for i in range(5)]

# Generate embeddings for all images
embeddings = model.embed_images(images)
print(f"Batch embeddings shape: {embeddings.shape}")

Workflows Integration¶

CLIP can be used in Roboflow Workflows for complex computer vision pipelines, including zero-shot classification and semantic search.

Learn more: Workflows Documentation

Performance Tips¶

Choose the right model - ViT models generally have better accuracy, ResNet models are faster
Use ViT-B-32 for balance - Good trade-off between speed and accuracy
Batch processing - Process multiple images together for better throughput
Use ONNX backend - Often faster than PyTorch for inference
Normalize embeddings - Use cosine similarity for comparing embeddings