PaliGemma / PaliGemma2 - Vision Language Model¶
PaliGemma and PaliGemma2 are versatile vision-language models developed by Google Research that combine the SigLIP vision encoder with the Gemma language model for multimodal understanding.
Overview¶
PaliGemma is a powerful VLM capable of handling diverse vision-language tasks:
- Visual Question Answering - Answer questions about image content
- Image Captioning - Generate descriptive captions for images
- Object Detection - Detect and locate objects through text prompts
- OCR - Extract and recognize text from images
- Document Understanding - Parse and analyze document content
GPU Recommended
PaliGemma works best with GPU acceleration. CPU inference may be very slow or may not work properly.
License & Attribution
License: Gemma Terms of Use
Source: Google Research
Paper: PaliGemma: A versatile 3B VLM for transfer
Terms: By using PaliGemma you agree to the Gemma Terms of Use
Pre-trained Model IDs¶
PaliGemma and PaliGemma2 pre-trained models are available and do not require a Roboflow API key.
PaliGemma Models¶
| Model ID | Description |
|---|---|
paligemma-3b-mix-224 |
3B model with 224x224 resolution - general purpose |
paligemma-3b-mix-448 |
3B model with 448x448 resolution - higher quality |
paligemma-3b-ft-cococap-224 |
Fine-tuned for image captioning (224px) |
paligemma-3b-ft-cococap-448 |
Fine-tuned for image captioning (448px) |
paligemma-3b-ft-vqav2-224 |
Fine-tuned for visual question answering (224px) |
paligemma-3b-ft-vqav2-448 |
Fine-tuned for visual question answering (448px) |
paligemma-3b-ft-docvqa-224 |
Fine-tuned for document VQA (224px) |
paligemma-3b-ft-docvqa-448 |
Fine-tuned for document VQA (448px) |
paligemma-3b-ft-ocrvqa-224 |
Fine-tuned for OCR VQA (224px) |
paligemma-3b-ft-ocrvqa-448 |
Fine-tuned for OCR VQA (448px) |
paligemma-3b-ft-screen2words-224 |
Fine-tuned for UI understanding (224px) |
paligemma-3b-ft-screen2words-448 |
Fine-tuned for UI understanding (448px) |
paligemma-3b-ft-tallyqa-224 |
Fine-tuned for counting tasks (224px) |
paligemma-3b-ft-tallyqa-448 |
Fine-tuned for counting tasks (448px) |
PaliGemma2 Models¶
| Model ID | Description |
|---|---|
paligemma2-3b-pt-224 |
PaliGemma2 3B pre-trained model with 224x224 resolution |
You can also use fine-tuned models from Roboflow by specifying project/version as the model ID (requires API key).
Supported Backends¶
| Backend | Extras Required |
|---|---|
torch |
torch-cpu, torch-cu118, torch-cu124, torch-cu126, torch-cu128, torch-jp6-cu126 |
Roboflow Platform Compatibility¶
| Feature | Supported |
|---|---|
| Training | ✅ LoRA fine-tuning only (Guide) |
| Upload Weights | ✅ Upload fine-tuned models |
| Serverless API (v2) | ⚠️ Limited support (not yet fully stable) |
| Workflows | ✅ Use in Workflows via PaliGemma block |
| Edge Deployment (Jetson) | ❌ Not supported |
| Self-Hosting | ✅ Deploy with inference-models (GPU recommended) |
Training & Fine-tuning¶
PaliGemma supports LoRA (Low-Rank Adaptation) fine-tuning only on the Roboflow platform. This allows you to adapt the model to your specific use case without training the entire model.
When to Fine-tune PaliGemma¶
Fine-tuning PaliGemma is beneficial when you need:
- Domain-specific VQA - Answer questions specific to your industry or use case
- Custom captioning - Generate captions with domain-specific terminology
- Specialized document understanding - Parse forms, receipts, or technical documents
- Task-specific performance - Optimize for particular vision-language tasks
Recommended Use Cases for Fine-tuning¶
- ✅ Medical imaging - Answer questions about medical scans or reports
- ✅ Document processing - Extract information from invoices, forms, or contracts
- ✅ E-commerce - Describe products or answer product-related questions
- ✅ Education - Answer questions about diagrams, charts, or educational content
- ✅ Accessibility - Generate detailed descriptions for visually impaired users
Learn more: PaliGemma Multimodal Vision Guide
Supported Tasks¶
PaliGemma supports multiple vision-language tasks through natural language prompts:
| Task | When to Use |
|---|---|
| Visual Question Answering | Answer any question about image content - most versatile task for general queries |
| Image Captioning | Generate descriptive captions by using prompts like "caption" or "describe this image" |
| Object Detection | Detect objects by asking "detect [object]" or similar prompts |
| OCR | Extract text by using prompts like "read the text" or "what does it say" |
| Document VQA | Ask questions about document content, forms, or structured data |
| Counting | Count objects by asking "how many [objects]" |
Usage Examples¶
Visual Question Answering¶
import cv2
from inference_models import AutoModel
# Load model
model = AutoModel.from_pretrained("paligemma-3b-mix-448")
image = cv2.imread("path/to/image.jpg")
# Ask a question
answers = model.prompt(images=image, prompt="What is in this image?")
print(f"Answer: {answers[0]}")
Image Captioning¶
import cv2
from inference_models import AutoModel
# Load model
model = AutoModel.from_pretrained("paligemma-3b-ft-cococap-448")
image = cv2.imread("path/to/image.jpg")
# Generate caption
captions = model.prompt(images=image, prompt="caption")
print(f"Caption: {captions[0]}")
Document Understanding¶
import cv2
from inference_models import AutoModel
# Load model fine-tuned for document VQA
model = AutoModel.from_pretrained("paligemma-3b-ft-docvqa-448")
image = cv2.imread("path/to/document.jpg")
# Ask about document content
answers = model.prompt(images=image, prompt="What is the total amount?")
print(f"Answer: {answers[0]}")
Using Fine-tuned Models¶
import cv2
from inference_models import AutoModel
# Load your fine-tuned model from Roboflow
model = AutoModel.from_pretrained(
"your-project/version",
api_key="your_roboflow_api_key"
)
image = cv2.imread("path/to/image.jpg")
# Use with custom prompt for your use case
answers = model.prompt(images=image, prompt="your custom question")
print(f"Answer: {answers[0]}")
Workflows Integration¶
PaliGemma can be used in Roboflow Workflows for complex computer vision pipelines. The PaliGemma block supports all task types and can be combined with other blocks for advanced processing.
Learn more: Workflows Documentation
Performance Tips¶
- Use GPU - PaliGemma requires GPU for acceptable performance
- Choose the right resolution - Use 224px for speed, 448px for accuracy
- Use task-specific models - Fine-tuned models (e.g.,
ft-docvqa) perform better on specific tasks - Optimize prompts - Clear, specific prompts yield better results
- Fine-tune for your domain - LoRA fine-tuning significantly improves task-specific performance