Florence-2 - Vision Language Model¶
Florence-2 is a versatile vision-language model developed by Microsoft Research that can perform a wide range of vision tasks through natural language prompts.
Overview¶
Florence-2 is a unified, prompt-based model capable of handling diverse computer vision and vision-language tasks:
- Object Detection - Detect and locate objects in images
- Instance Segmentation - Segment individual object instances
- Image Captioning - Generate descriptive captions for images
- Optical Character Recognition (OCR) - Extract text from images
- Phrase Grounding - Locate objects based on text descriptions
- Region Captioning - Generate captions for specific image regions
- Open Vocabulary Detection - Detect objects from custom class lists
GPU Recommended
Florence-2 works best with GPU acceleration. CPU inference may be very slow or may not work properly for larger models.
License & Attribution
License: MIT
Source: Microsoft Research
Paper: Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
Pre-trained Model IDs¶
Florence-2 pre-trained models are available and do not require a Roboflow API key.
| Model ID | Description |
|---|---|
florence-2-base |
Base model (0.23B parameters) - faster inference |
florence-2-large |
Large model (0.77B parameters) - better accuracy |
You can also use fine-tuned models from Roboflow by specifying project/version as the model ID (requires API key).
Supported Backends¶
| Backend | Extras Required |
|---|---|
torch |
torch-cpu, torch-cu118, torch-cu124, torch-cu126, torch-cu128, torch-jp6-cu126 |
Roboflow Platform Compatibility¶
| Feature | Supported |
|---|---|
| Training | ✅ LoRA fine-tuning only (Guide) |
| Upload Weights | ✅ Upload fine-tuned models |
| Serverless API (v2) | ⚠️ Limited support (not yet fully stable) |
| Workflows | ✅ Use in Workflows via Florence-2 block |
| Edge Deployment (Jetson) | ❌ Not supported |
| Self-Hosting | ✅ Deploy with inference-models (GPU recommended) |
Training & Fine-tuning¶
Florence-2 supports LoRA (Low-Rank Adaptation) fine-tuning only on the Roboflow platform. This allows you to adapt the model to your specific use case without training the entire model.
When to Fine-tune Florence-2¶
Fine-tuning Florence-2 is beneficial when you need:
- Domain-specific object detection - Detect specialized objects in your industry (medical, industrial, etc.)
- Custom captioning style - Generate captions that match your specific terminology or format
- Specialized OCR - Improve text recognition for domain-specific fonts or layouts
- Task-specific performance - Optimize for a particular vision task with your data
Recommended Use Cases for Fine-tuning¶
- ✅ Medical imaging - Detect anatomical structures or abnormalities
- ✅ Industrial inspection - Identify defects or components in manufacturing
- ✅ Document analysis - Extract structured data from forms or receipts
- ✅ Retail - Detect products or analyze shelf layouts
- ✅ Agriculture - Identify crops, pests, or plant diseases
Learn more: How to Fine-tune Florence-2
Supported Tasks¶
Florence-2 supports multiple vision tasks. Use the high-level API methods for each task:
| Task | Method | When to Use |
|---|---|---|
| Image Captioning | caption_image() |
Generate natural language descriptions of images at different levels of detail (normal, detailed, very detailed) |
| Object Detection | detect_objects() |
Detect and locate all objects in an image, or detect specific classes using open vocabulary detection |
| OCR | ocr_image() |
Extract all text from images without location information |
| OCR with Detection | parse_document() |
Extract text with bounding box locations - ideal for document parsing and structured text extraction |
| Phrase Grounding | ground_phrase() |
Locate objects in an image based on a text description (e.g., "red car", "person wearing hat") |
| Instance Segmentation | segment_phrase() |
Generate segmentation masks for objects matching a text description |
| Region Segmentation | segment_region() |
Generate segmentation masks for objects within specified bounding boxes |
| Region Classification | classify_image_region() |
Classify what object is in a specific region of the image |
| Region Captioning | caption_image_region() |
Generate captions for specific regions of the image |
| Region OCR | ocr_image_region() |
Extract text from specific regions of the image |
Usage Examples¶
Image Captioning¶
import cv2
from inference_models import AutoModel
# Load model
model = AutoModel.from_pretrained("florence-2-base")
# Load image
image = cv2.imread("path/to/image.jpg")
# Generate caption (normal, detailed, or very_detailed)
captions = model.caption_image(images=image, granularity="detailed")
print(f"Caption: {captions[0]}")
Object Detection¶
import cv2
from inference_models import AutoModel
# Load model
model = AutoModel.from_pretrained("florence-2-base")
image = cv2.imread("path/to/image.jpg")
# Detect all objects with class labels
detections = model.detect_objects(images=image, labels_mode="classes")
print(detections[0])
# Detect specific classes (open vocabulary)
detections = model.detect_objects(
images=image,
classes=["person", "car", "dog"]
)
print(detections[0])
OCR (Text Extraction)¶
import cv2
from inference_models import AutoModel
# Load model
model = AutoModel.from_pretrained("florence-2-base")
image = cv2.imread("path/to/document.jpg")
# Extract text only
text = model.ocr_image(images=image)
print(f"Text: {text[0]}")
# Extract text with bounding boxes
detections = model.parse_document(images=image)
print(detections[0]) # Detections with text in bboxes_metadata
Phrase Grounding¶
import cv2
from inference_models import AutoModel
# Load model
model = AutoModel.from_pretrained("florence-2-base")
image = cv2.imread("path/to/image.jpg")
# Find objects matching a phrase
detections = model.ground_phrase(images=image, phrase="red car")
print(detections[0])
Instance Segmentation¶
import cv2
from inference_models import AutoModel
# Load model
model = AutoModel.from_pretrained("florence-2-base")
image = cv2.imread("path/to/image.jpg")
# Segment objects matching a phrase
instance_detections = model.segment_phrase(images=image, phrase="person")
print(instance_detections[0]) # InstanceDetections with masks
Using Fine-tuned Models¶
import cv2
from inference_models import AutoModel
# Load your fine-tuned model from Roboflow
model = AutoModel.from_pretrained(
"your-workspace/your-model/version",
api_key="your_roboflow_api_key"
)
image = cv2.imread("path/to/image.jpg")
# Use the model (API depends on what you fine-tuned for)
detections = model.detect_objects(images=image)
print(detections[0])
Workflows Integration¶
Florence-2 can be used in Roboflow Workflows for complex computer vision pipelines. The Florence-2 block supports all task types and can be combined with other blocks for advanced processing.
Learn more: Workflows Documentation
Performance Tips¶
- Use GPU - Florence-2 requires GPU for acceptable performance
- Choose the right model size - Use
basefor speed,largefor accuracy - Optimize prompts - Use specific task prompts for better results
- Fine-tune for your domain - LoRA fine-tuning significantly improves task-specific performance
- Batch processing - Process multiple images together when possible
Key Differences from Other VLMs¶
| Feature | Florence-2 | PaliGemma | Qwen2.5-VL |
|---|---|---|---|
| Model Size | 0.23B - 0.77B | 3B - 10B | 2B - 72B |
| Speed | Fast | Medium | Slower |
| Task Versatility | Very High | High | Very High |
| OCR Quality | Excellent | Good | Excellent |
| Fine-tuning | LoRA | LoRA | LoRA |
| License | MIT | Gemma License | Apache 2.0 |