SmolVLM2 - Vision Language Model¶
SmolVLM2 is a compact and efficient vision-language model developed by HuggingFace that provides strong multimodal understanding capabilities in a small package.
Overview¶
SmolVLM2 is a lightweight VLM designed for efficiency while maintaining strong performance:
- Visual Question Answering - Answer questions about image content
- Image Captioning - Generate descriptive captions for images
- Efficient Inference - Smaller model size enables faster inference
- Multi-image Support - Process multiple images in a single prompt
- General Understanding - Handle diverse vision-language tasks
GPU Recommended
SmolVLM2 works best with GPU acceleration. CPU inference may be slow.
License & Attribution
License: Apache 2.0
Source: HuggingFace
Pre-trained Model IDs¶
SmolVLM2 pre-trained models are available and do not require a Roboflow API key.
| Model ID | Description |
|---|---|
smolvlm2 |
2B parameter instruction-tuned model for general vision-language tasks |
You can also use fine-tuned models from Roboflow by specifying project/version as the model ID (requires API key).
Supported Backends¶
| Backend | Extras Required |
|---|---|
torch |
torch-cpu, torch-cu118, torch-cu124, torch-cu126, torch-cu128, torch-jp6-cu126 |
Roboflow Platform Compatibility¶
| Feature | Supported |
|---|---|
| Training | ✅ LoRA fine-tuning only |
| Upload Weights | ✅ Upload fine-tuned models |
| Serverless API (v2) | ⚠️ Limited support (not yet fully stable) |
| Workflows | ✅ Use in Workflows via SmolVLM2 block |
| Edge Deployment (Jetson) | ❌ Not supported |
| Self-Hosting | ✅ Deploy with inference-models (GPU recommended) |
Usage Examples¶
Visual Question Answering¶
import cv2
from inference_models import AutoModel
# Load model
model = AutoModel.from_pretrained("smolvlm2")
image = cv2.imread("path/to/image.jpg")
# Ask a question
answers = model.prompt(
images=image,
prompt="What is in this image?",
max_new_tokens=400
)
print(f"Answer: {answers[0]}")
Image Captioning¶
import cv2
from inference_models import AutoModel
# Load model
model = AutoModel.from_pretrained("smolvlm2")
image = cv2.imread("path/to/image.jpg")
# Generate caption (uses default prompt if none provided)
captions = model.prompt(
images=image,
prompt="Describe what's in this image.",
max_new_tokens=400
)
print(f"Caption: {captions[0]}")
Multi-image Understanding¶
import cv2
from inference_models import AutoModel
# Load model
model = AutoModel.from_pretrained("smolvlm2")
# Load multiple images
images = [
cv2.imread("path/to/image1.jpg"),
cv2.imread("path/to/image2.jpg"),
cv2.imread("path/to/image3.jpg")
]
# Analyze multiple images together in a single prompt
answers = model.prompt(
images=images,
prompt="What are the differences between these images?",
images_to_single_prompt=True,
max_new_tokens=400
)
print(f"Answer: {answers[0]}")
Using Fine-tuned Models¶
import cv2
from inference_models import AutoModel
# Load your fine-tuned model from Roboflow
model = AutoModel.from_pretrained(
"your-project/version",
api_key="your_roboflow_api_key"
)
image = cv2.imread("path/to/image.jpg")
# Use with custom prompt
answers = model.prompt(
images=image,
prompt="your custom question",
max_new_tokens=400
)
print(f"Answer: {answers[0]}")
Workflows Integration¶
SmolVLM2 can be used in Roboflow Workflows for complex computer vision pipelines.
Learn more: Workflows Documentation
Performance Tips¶
- Use GPU - SmolVLM2 benefits from GPU acceleration
- Optimize prompts - Clear, specific prompts yield better results
- Adjust max_new_tokens - Default is 400; increase for longer responses
- Multi-image prompts - Use
images_to_single_prompt=Trueto analyze multiple images together in a single prompt