Moondream2 - Vision Language Model¶
Moondream2 is a compact vision-language model designed for efficient multimodal understanding with specialized capabilities for detection, captioning, and visual question answering.
Overview¶
Moondream2 is a lightweight VLM with unique capabilities:
- Object Detection - Detect objects through natural language queries
- Visual Question Answering - Answer questions about image content
- Image Captioning - Generate captions with adjustable length
- Point Detection - Locate specific objects and return coordinates
- Efficient Design - Small model size for faster inference
Device Compatibility
Moondream2 cannot run on Apple devices with MPS due to a bug in the original implementation. Use NVIDIA GPU or x86 CPU instead.
License & Attribution
License: Apache 2.0
Source: vikhyatk/moondream
Pre-trained Model IDs¶
Moondream2 pre-trained models are available and do not require a Roboflow API key.
| Model ID | Description |
|---|---|
moondream2 |
Latest version - general purpose vision-language model |
You can also use fine-tuned models from Roboflow by specifying project/version as the model ID (requires API key).
Supported Backends¶
| Backend | Extras Required |
|---|---|
torch |
torch-cpu, torch-cu118, torch-cu124, torch-cu126, torch-cu128 |
MPS Not Supported
Moondream2 does not support Apple MPS acceleration. Use CPU or CUDA only.
Roboflow Platform Compatibility¶
| Feature | Supported |
|---|---|
| Training | ❌ Not supported |
| Upload Weights | ❌ Not supported |
| Serverless API (v2) | ⚠️ Limited support (not yet fully stable) |
| Workflows | ✅ Use in Workflows via Moondream2 block |
| Edge Deployment (Jetson) | ❌ Not supported |
| Self-Hosting | ✅ Deploy with inference-models (GPU recommended) |
Supported Tasks¶
Moondream2 supports multiple vision-language tasks through specialized methods:
| Task | Method | When to Use |
|---|---|---|
| Object Detection | detect() |
Detect and locate objects by specifying class names |
| Visual Question Answering | prompt() |
Answer questions about image content |
| Image Captioning | caption() |
Generate descriptive captions with adjustable length |
| Point Detection | point() |
Locate specific objects and return their coordinates |
Usage Examples¶
Object Detection¶
import cv2
from inference_models import AutoModel
# Load model
model = AutoModel.from_pretrained("moondream2")
image = cv2.imread("path/to/image.jpg")
# Detect objects by class name
detections = model.detect(
images=image,
classes=["person", "car", "dog"],
max_tokens=700
)
# Access detection results
for detection in detections[0]:
print(f"Class: {detection.class_name}")
print(f"Confidence: {detection.confidence}")
print(f"Box: {detection.xyxy}")
Visual Question Answering¶
import cv2
from inference_models import AutoModel
# Load model
model = AutoModel.from_pretrained("moondream2")
image = cv2.imread("path/to/image.jpg")
# Ask a question
answers = model.query(
images=image,
question="What is the person doing?",
max_tokens=700
)
print(f"Answer: {answers[0]}")
Image Captioning¶
import cv2
from inference_models import AutoModel
# Load model
model = AutoModel.from_pretrained("moondream2")
image = cv2.imread("path/to/image.jpg")
# Generate caption with adjustable length
captions = model.caption(
images=image,
length="normal", # Options: "short", "normal", "long"
max_tokens=700
)
print(f"Caption: {captions[0]}")
Point Detection¶
import cv2
from inference_models import AutoModel
# Load model
model = AutoModel.from_pretrained("moondream2")
image = cv2.imread("path/to/image.jpg")
# Find specific object location
points = model.point(
images=image,
object="the red car",
max_tokens=700
)
# Access point coordinates
for point in points[0]:
print(f"Location: x={point.x}, y={point.y}")
print(f"Confidence: {point.confidence}")
Using Fine-tuned Models¶
import cv2
from inference_models import AutoModel
# Load your fine-tuned model from Roboflow
model = AutoModel.from_pretrained(
"your-project/version",
api_key="your_roboflow_api_key"
)
image = cv2.imread("path/to/image.jpg")
# Use any of the methods above
answers = model.query(
images=image,
question="your custom question",
max_tokens=700
)
print(f"Answer: {answers[0]}")
Workflows Integration¶
Moondream2 can be used in Roboflow Workflows for complex computer vision pipelines.
Learn more: Workflows Documentation
Performance Tips¶
- Use CUDA GPU - Moondream2 benefits from GPU acceleration (MPS not supported)
- Adjust max_tokens - Default is 700; increase for more detailed responses
- Use specific prompts - Clear object names in
detect()andpoint()yield better results - Choose caption length - Use "short" for speed, "long" for detail
- Batch processing - Process multiple images by passing a list