TrOCR - Transformer-based Optical Character Recognition¶
TrOCR is a transformer-based OCR model developed by Microsoft that excels at recognizing text from pre-cropped image regions. It uses a vision encoder-decoder architecture for end-to-end text recognition.
Overview¶
Resources: Research Paper | Hugging Face Models
TrOCR provides state-of-the-art text recognition for single-line text regions. Key features include:
- Transformer architecture - Modern encoder-decoder design for superior accuracy
- Pre-cropped text - Optimized for single text regions (no detection stage)
- High accuracy - Excellent recognition quality on clean text
- Multiple variants - Small, base, and large models for different accuracy/speed tradeoffs
- Handwriting support - Specialized models for handwritten text (can be added upon request)
OCR Type: Unstructured OCR¶
Unstructured OCR recognizes text from a pre-cropped image region containing a single line of text. It returns only the recognized text string without bounding box information.
When to Use TrOCR¶
- ✅ Pre-cropped text - When you already have isolated text regions
- ✅ Single-line text - Serial numbers, labels, captions, single words
- ✅ High accuracy needed - When recognition quality is critical
- ✅ Handwritten text - Use handwriting-specific models
- ✅ After object detection - Recognize text in detected bounding boxes
When to Use Other OCR Models¶
- DocTR: Better for full documents where you need to detect text locations
- EasyOCR: Better for scene text with detection and multi-language support
License¶
MIT License
Open Source License
TrOCR is licensed under MIT, making it free for both commercial and non-commercial use without restrictions.
Learn more: MIT License
Pre-trained Model IDs¶
Pre-trained TrOCR models are available via the Roboflow API and require a Roboflow API key.
Getting a Roboflow API Key
To use TrOCR models, you'll need a Roboflow account (free) and API key.
| Model Variant | Model ID | Use Case | Size |
|---|---|---|---|
| Small (Printed) | microsoft/trocr-small-printed |
Fast printed text recognition | Small |
| Base (Printed) | microsoft/trocr-base-printed |
Balanced printed text recognition | Base |
| Large (Printed) | microsoft/trocr-large-printed |
High-accuracy printed text | Large |
Recommendation: Start with small or base models for faster inference. Use large models when accuracy is critical.
Supported Backends¶
| Backend | Extras Required |
|---|---|
torch |
torch-cpu, torch-cu118, torch-cu124, torch-cu126, torch-cu128, torch-jp6-cu126 |
Roboflow Platform Compatibility¶
| Feature | Supported |
|---|---|
| Training | ❌ Not available for training |
| Upload Weights | ❌ Not supported |
| Serverless API (v2) | ✅ Deploy via hosted API |
| Workflows | ✅ Use in Workflows via OCR block |
| Edge Deployment (Jetson) | ✅ Deploy on NVIDIA Jetson devices |
| Self-Hosting | ✅ Deploy with inference-models |
Installation¶
Install with PyTorch extras:
- PyTorch:
torch-cpu,torch-cu118,torch-cu124,torch-cu126,torch-cu128,torch-jp6-cu126
Usage Example¶
import cv2
from inference_models import AutoModel
# Load TrOCR model for printed text
model = AutoModel.from_pretrained(
"microsoft/trocr-base-printed",
api_key="your_roboflow_api_key"
)
# Load pre-cropped images containing single lines of text
images = [
cv2.imread("path/to/cropped_text1.jpg"),
cv2.imread("path/to/cropped_text2.jpg"),
]
# Run inference - returns list of recognized text strings
texts = model(images)
# Print results
for i, text in enumerate(texts):
print(f"Image {i+1}: {text}")
Combining with Object Detection¶
TrOCR works great for recognizing text in detected regions (e.g., after object detection). Instead of manually combining models in code, we recommend using Roboflow Workflows to easily create pipelines that:
- Detect text regions with object detection or DocTR/EasyOCR
- Crop detected regions
- Run TrOCR on each cropped region
- Return structured results
Learn more: Roboflow Workflows Documentation
Output Format¶
TrOCR returns a List[str] containing the recognized text from the input images.
- Single image input: Returns a list with one string
- Batch input: Returns a list of strings, one per image
Important: Each input image should contain a single line of text. For multi-line text or text detection, use DocTR or EasyOCR instead.