DocTR - Optical Character Recognition¶
DocTR (Document Text Recognition) is a comprehensive OCR solution developed by Mindee. It combines text detection and recognition to extract text from documents and images.
Overview¶
Resources: GitHub Repository | Documentation
DocTR provides end-to-end document text recognition with both detection and recognition stages. Key features include:
- Two-stage pipeline - Separate detection and recognition models for optimal performance
- Document-focused - Optimized for document layouts and structured text
- Multiple architectures - Various detection and recognition model combinations
- Production-ready - Battle-tested on real-world documents
- Flexible deployment - Multiple model size options for different use cases
OCR Type: Structured OCR¶
Structured OCR detects text regions in an image and recognizes the text within each region, returning both the text content and bounding box coordinates for each detected text block.
When to Use DocTR¶
- ✅ Documents and forms - Scanned documents, invoices, receipts
- ✅ Multi-line text - Paragraphs and structured layouts
- ✅ Text localization needed - When you need to know where text appears
- ✅ Mixed content - Documents with text in various locations
When to Use Other OCR Models¶
- EasyOCR: Better for scene text (signs, labels) and multi-language support
- TrOCR: Better for single-line, pre-cropped text (serial numbers, labels)
License¶
Apache 2.0
Open Source License
DocTR is licensed under Apache 2.0, making it free for both commercial and non-commercial use without restrictions.
Learn more: Apache 2.0 License
Pre-trained Model IDs¶
Pre-trained DocTR models are available via the Roboflow API and require a Roboflow API key.
Getting a Roboflow API Key
To use DocTR models, you'll need a Roboflow account (free) and API key.
DocTR model IDs combine a detection model and a recognition model using the format: doctr-{detection}/{recognition}
Detection Models¶
Detection models locate text regions in the image and output bounding boxes.
| Model | ID Chunk | Speed | Accuracy | Description |
|---|---|---|---|---|
| FAST Tiny | fast-t |
Very Fast | Low | Lightweight FAST architecture |
| FAST Small | fast-s |
Fast | Medium | Balanced FAST variant |
| FAST Base | fast-b |
Medium | Good | Standard FAST model |
| DB ResNet50 | dbnet-rn50 |
Medium | High | Differentiable Binarization with ResNet50 backbone |
| DB ResNet34 | dbnet-rn34 |
Medium | High | DB with ResNet34 backbone |
| DB MobileNet V3 Large | db-net-mobilenet-v3-l |
Fast | Medium | DB with efficient MobileNet backbone |
| LinkNet ResNet18 | linknet-rn18 |
Fast | Medium | LinkNet segmentation with ResNet18 |
| LinkNet ResNet34 | linknet-rn34 |
Medium | Good | LinkNet with ResNet34 |
| LinkNet ResNet50 | linknet-rn50 |
Medium | High | LinkNet with ResNet50 |
Recognition Models¶
Recognition models read the text content from detected regions.
| Model | ID Chunk | Speed | Accuracy | Description |
|---|---|---|---|---|
| CRNN VGG16 | crnn-vgg16 |
Medium | Good | CNN-RNN hybrid with VGG16 encoder |
| CRNN MobileNet V3 Small | crnn-mobilenet-v3-small |
Very Fast | Medium | Efficient CRNN with small MobileNet |
| CRNN MobileNet V3 Large | crnn-mobilenet-v3-large |
Fast | Good | CRNN with larger MobileNet |
| SAR ResNet31 | sar-rn31 |
Slow | High | Show, Attend and Read - attention-based |
| MASTER | master |
Slow | High | Multi-Aspect Non-local Network |
| ViTSTR Small | vitstr-s |
Medium | Good | Vision Transformer for text recognition |
| PARSeq | parseq |
Medium | High | Permutation Language Modeling - state-of-the-art |
Example Model IDs¶
Combine any detection model with any recognition model:
doctr-dbnet-rn50/crnn-vgg16- DB ResNet50 detection + CRNN VGG16 recognitiondoctr-fast-b/parseq- FAST Base detection + PARSeq recognitiondoctr-db-net-mobilenet-v3-l/crnn-mobilenet-v3-small- MobileNet detection + MobileNet recognition (fastest)doctr-dbnet-rn50/sar-rn31- DB ResNet50 detection + SAR ResNet31 recognition (highest accuracy)
Total combinations: 9 detection models × 7 recognition models = 63 possible model configurations
Supported Backends¶
| Backend | Extras Required |
|---|---|
torch |
torch-cpu, torch-cu118, torch-cu124, torch-cu126, torch-cu128, torch-jp6-cu126 |
Roboflow Platform Compatibility¶
| Feature | Supported |
|---|---|
| Training | ❌ Not available for training |
| Upload Weights | ❌ Not supported |
| Serverless API (v2) | ✅ Deploy via hosted API |
| Workflows | ✅ Use in Workflows via OCR block |
| Edge Deployment (Jetson) | ✅ Deploy on NVIDIA Jetson devices |
| Self-Hosting | ✅ Deploy with inference-models |
Installation¶
Install with PyTorch extras:
- PyTorch:
torch-cpu,torch-cu118,torch-cu124,torch-cu126,torch-cu128,torch-jp6-cu126
Usage Example¶
import cv2
import supervision as sv
from inference_models import AutoModel
# Load DocTR model (DB ResNet50 detection + PARSeq recognition)
model = AutoModel.from_pretrained(
"doctr-dbnet-rn50/parseq",
api_key="your_roboflow_api_key"
)
# Load image
image = cv2.imread("path/to/document.jpg")
# Run inference - returns (texts, detections)
texts, detections = model(image)
# Print detected text
for text in texts:
print(f"Detected: {text}")
# Visualize results with supervision
box_annotator = sv.BoxAnnotator()
label_annotator = sv.LabelAnnotator()
# Annotate image with bounding boxes and text labels
annotated_image = box_annotator.annotate(scene=image.copy(), detections=detections[0])
annotated_image = label_annotator.annotate(scene=annotated_image, detections=detections[0], labels=texts)
# Save or display
cv2.imwrite("output.jpg", annotated_image)
Output Format¶
DocTR returns a tuple of (List[str], List[Detections]):
- texts: List of recognized text strings, one per detected text region
- detections: List of Detections objects with bounding boxes and metadata
This structured output allows you to know both what text was detected and where it appears in the image.