Moondream2 - Vision Language Model¶

Moondream2 is a compact vision-language model designed for efficient multimodal understanding with specialized capabilities for detection, captioning, and visual question answering.

Overview¶

Moondream2 is a lightweight VLM with unique capabilities:

Object Detection - Detect objects through natural language queries
Visual Question Answering - Answer questions about image content
Image Captioning - Generate captions with adjustable length
Point Detection - Locate specific objects and return coordinates
Efficient Design - Small model size for faster inference

Device Compatibility

Moondream2 cannot run on Apple devices with MPS due to a bug in the original implementation. Use NVIDIA GPU or x86 CPU instead.

License & Attribution

License: Apache 2.0
Source: vikhyatk/moondream

Pre-trained Model IDs¶

Moondream2 pre-trained models are available and do not require a Roboflow API key.

Model ID	Description
`moondream2`	Latest version - general purpose vision-language model

You can also use fine-tuned models from Roboflow by specifying project/version as the model ID (requires API key).

Supported Backends¶

Backend	Extras Required
`torch`	`torch-cpu`, `torch-cu118`, `torch-cu124`, `torch-cu126`, `torch-cu128`

MPS Not Supported

Moondream2 does not support Apple MPS acceleration. Use CPU or CUDA only.

Roboflow Platform Compatibility¶

Feature	Supported
Training	❌ Not supported
Upload Weights	❌ Not supported
Serverless API (v2)	⚠️ Limited support (not yet fully stable)
Workflows	✅ Use in Workflows via Moondream2 block
Edge Deployment (Jetson)	❌ Not supported
Self-Hosting	✅ Deploy with `inference-models` (GPU recommended)

Supported Tasks¶

Moondream2 supports multiple vision-language tasks through specialized methods:

Task	Method	When to Use
Object Detection	`detect()`	Detect and locate objects by specifying class names
Visual Question Answering	`prompt()`	Answer questions about image content
Image Captioning	`caption()`	Generate descriptive captions with adjustable length
Point Detection	`point()`	Locate specific objects and return their coordinates

Usage Examples¶

Object Detection¶

import cv2
from inference_models import AutoModel

# Load model
model = AutoModel.from_pretrained("moondream2")
image = cv2.imread("path/to/image.jpg")

# Detect objects by class name
detections = model.detect(
    images=image,
    classes=["person", "car", "dog"],
    max_tokens=700
)

# Access detection results
for detection in detections[0]:
    print(f"Class: {detection.class_name}")
    print(f"Confidence: {detection.confidence}")
    print(f"Box: {detection.xyxy}")

Visual Question Answering¶

import cv2
from inference_models import AutoModel

# Load model
model = AutoModel.from_pretrained("moondream2")
image = cv2.imread("path/to/image.jpg")

# Ask a question
answers = model.query(
    images=image,
    question="What is the person doing?",
    max_tokens=700
)
print(f"Answer: {answers[0]}")

Image Captioning¶

import cv2
from inference_models import AutoModel

# Load model
model = AutoModel.from_pretrained("moondream2")
image = cv2.imread("path/to/image.jpg")

# Generate caption with adjustable length
captions = model.caption(
    images=image,
    length="normal",  # Options: "short", "normal", "long"
    max_tokens=700
)
print(f"Caption: {captions[0]}")

Point Detection¶

import cv2
from inference_models import AutoModel

# Load model
model = AutoModel.from_pretrained("moondream2")
image = cv2.imread("path/to/image.jpg")

# Find specific object location
points = model.point(
    images=image,
    object="the red car",
    max_tokens=700
)

# Access point coordinates
for point in points[0]:
    print(f"Location: x={point.x}, y={point.y}")
    print(f"Confidence: {point.confidence}")

Using Fine-tuned Models¶

import cv2
from inference_models import AutoModel

# Load your fine-tuned model from Roboflow
model = AutoModel.from_pretrained(
    "your-project/version",
    api_key="your_roboflow_api_key"
)

image = cv2.imread("path/to/image.jpg")

# Use any of the methods above
answers = model.query(
    images=image,
    question="your custom question",
    max_tokens=700
)
print(f"Answer: {answers[0]}")

Workflows Integration¶

Moondream2 can be used in Roboflow Workflows for complex computer vision pipelines.

Learn more: Workflows Documentation

Performance Tips¶

Use CUDA GPU - Moondream2 benefits from GPU acceleration (MPS not supported)
Adjust max_tokens - Default is 700; increase for more detailed responses
Use specific prompts - Clear object names in detect() and point() yield better results
Choose caption length - Use "short" for speed, "long" for detail
Batch processing - Process multiple images by passing a list