Skip to content

DocTR - Optical Character Recognition

DocTR (Document Text Recognition) is a comprehensive OCR solution developed by Mindee. It combines text detection and recognition to extract text from documents and images.

Overview

Resources: GitHub Repository | Documentation

DocTR provides end-to-end document text recognition with both detection and recognition stages. Key features include:

  • Two-stage pipeline - Separate detection and recognition models for optimal performance
  • Document-focused - Optimized for document layouts and structured text
  • Multiple architectures - Various detection and recognition model combinations
  • Production-ready - Battle-tested on real-world documents
  • Flexible deployment - Multiple model size options for different use cases

OCR Type: Structured OCR

Structured OCR detects text regions in an image and recognizes the text within each region, returning both the text content and bounding box coordinates for each detected text block.

When to Use DocTR

  • Documents and forms - Scanned documents, invoices, receipts
  • Multi-line text - Paragraphs and structured layouts
  • Text localization needed - When you need to know where text appears
  • Mixed content - Documents with text in various locations

When to Use Other OCR Models

  • EasyOCR: Better for scene text (signs, labels) and multi-language support
  • TrOCR: Better for single-line, pre-cropped text (serial numbers, labels)

License

Apache 2.0

Open Source License

DocTR is licensed under Apache 2.0, making it free for both commercial and non-commercial use without restrictions.

Learn more: Apache 2.0 License

Pre-trained Model IDs

Pre-trained DocTR models are available via the Roboflow API and require a Roboflow API key.

Getting a Roboflow API Key

To use DocTR models, you'll need a Roboflow account (free) and API key.

DocTR model IDs combine a detection model and a recognition model using the format: doctr-{detection}/{recognition}

Detection Models

Detection models locate text regions in the image and output bounding boxes.

Model ID Chunk Speed Accuracy Description
FAST Tiny fast-t Very Fast Low Lightweight FAST architecture
FAST Small fast-s Fast Medium Balanced FAST variant
FAST Base fast-b Medium Good Standard FAST model
DB ResNet50 dbnet-rn50 Medium High Differentiable Binarization with ResNet50 backbone
DB ResNet34 dbnet-rn34 Medium High DB with ResNet34 backbone
DB MobileNet V3 Large db-net-mobilenet-v3-l Fast Medium DB with efficient MobileNet backbone
LinkNet ResNet18 linknet-rn18 Fast Medium LinkNet segmentation with ResNet18
LinkNet ResNet34 linknet-rn34 Medium Good LinkNet with ResNet34
LinkNet ResNet50 linknet-rn50 Medium High LinkNet with ResNet50

Recognition Models

Recognition models read the text content from detected regions.

Model ID Chunk Speed Accuracy Description
CRNN VGG16 crnn-vgg16 Medium Good CNN-RNN hybrid with VGG16 encoder
CRNN MobileNet V3 Small crnn-mobilenet-v3-small Very Fast Medium Efficient CRNN with small MobileNet
CRNN MobileNet V3 Large crnn-mobilenet-v3-large Fast Good CRNN with larger MobileNet
SAR ResNet31 sar-rn31 Slow High Show, Attend and Read - attention-based
MASTER master Slow High Multi-Aspect Non-local Network
ViTSTR Small vitstr-s Medium Good Vision Transformer for text recognition
PARSeq parseq Medium High Permutation Language Modeling - state-of-the-art

Example Model IDs

Combine any detection model with any recognition model:

  • doctr-dbnet-rn50/crnn-vgg16 - DB ResNet50 detection + CRNN VGG16 recognition
  • doctr-fast-b/parseq - FAST Base detection + PARSeq recognition
  • doctr-db-net-mobilenet-v3-l/crnn-mobilenet-v3-small - MobileNet detection + MobileNet recognition (fastest)
  • doctr-dbnet-rn50/sar-rn31 - DB ResNet50 detection + SAR ResNet31 recognition (highest accuracy)

Total combinations: 9 detection models × 7 recognition models = 63 possible model configurations

Supported Backends

Backend Extras Required
torch torch-cpu, torch-cu118, torch-cu124, torch-cu126, torch-cu128, torch-jp6-cu126

Roboflow Platform Compatibility

Feature Supported
Training ❌ Not available for training
Upload Weights ❌ Not supported
Serverless API (v2) Deploy via hosted API
Workflows ✅ Use in Workflows via OCR block
Edge Deployment (Jetson) ✅ Deploy on NVIDIA Jetson devices
Self-Hosting ✅ Deploy with inference-models

Installation

Install with PyTorch extras:

  • PyTorch: torch-cpu, torch-cu118, torch-cu124, torch-cu126, torch-cu128, torch-jp6-cu126

Usage Example

import cv2
import supervision as sv
from inference_models import AutoModel

# Load DocTR model (DB ResNet50 detection + PARSeq recognition)
model = AutoModel.from_pretrained(
    "doctr-dbnet-rn50/parseq",
    api_key="your_roboflow_api_key"
)

# Load image
image = cv2.imread("path/to/document.jpg")

# Run inference - returns (texts, detections)
texts, detections = model(image)

# Print detected text
for text in texts:
    print(f"Detected: {text}")

# Visualize results with supervision
box_annotator = sv.BoxAnnotator()
label_annotator = sv.LabelAnnotator()

# Annotate image with bounding boxes and text labels
annotated_image = box_annotator.annotate(scene=image.copy(), detections=detections[0])
annotated_image = label_annotator.annotate(scene=annotated_image, detections=detections[0], labels=texts)

# Save or display
cv2.imwrite("output.jpg", annotated_image)

Output Format

DocTR returns a tuple of (List[str], List[Detections]):

  • texts: List of recognized text strings, one per detected text region
  • detections: List of Detections objects with bounding boxes and metadata

This structured output allows you to know both what text was detected and where it appears in the image.