GLM-OCR - Vision-Language OCR¶
GLM-OCR is a vision-language model developed by Zhipu AI (ZAI) that excels at optical character recognition using an image-text-to-text architecture. It combines visual understanding with text generation for accurate text recognition.
Overview¶
Resources: Hugging Face Model
GLM-OCR provides high-quality text recognition using a modern vision-language approach. Key features include:
- Vision-language architecture - Uses AutoModelForImageTextToText for end-to-end OCR
- Prompt-based - Customizable prompts for different recognition tasks
- High accuracy - Strong performance on diverse text types
- Flash Attention support - Automatic flash attention on supported GPUs (Ampere+)
OCR Type: Unstructured OCR¶
Unstructured OCR recognizes text from an image using a vision-language model. It returns the recognized text string based on the given prompt.
When to Use GLM-OCR¶
- Serial numbers & labels - Recognizing text from product labels, serial numbers
- Scene text - Text in natural images and photos
- Document text - Recognizing text from document images
- Custom prompts - When you need to guide the model with specific instructions
When to Use Other OCR Models¶
- DocTR: Better for full document layout analysis with text detection and bounding boxes
- EasyOCR: Better for multi-language scene text with detection and localization
- TrOCR: Lighter alternative for simple single-line pre-cropped text
License¶
MIT License
Open Source License
GLM-OCR is licensed under MIT by Zhipu AI (ZAI), making it free for both commercial and non-commercial use without restrictions.
Learn more: MIT License
Pre-trained Model IDs¶
Pre-trained GLM-OCR models are available via the Roboflow API and require a Roboflow API key.
Getting a Roboflow API Key
To use GLM-OCR models, you'll need a Roboflow account (free) and API key.
| Model ID | Description |
|---|---|
glm-ocr |
GLM-OCR vision-language model for text recognition |
GPU Required
GLM-OCR uses bfloat16 precision and requires GPU acceleration. CPU inference is not supported.
Supported Backends¶
| Backend | Extras Required |
|---|---|
torch |
torch-cu118, torch-cu124, torch-cu126, torch-cu128, torch-jp6-cu126 |
Roboflow Platform Compatibility¶
| Feature | Supported |
|---|---|
| Training | ❌ Not available for training |
| Upload Weights | ❌ Not supported |
| Serverless API (v2) | ✅ Deploy via hosted API |
| Workflows | ✅ Use in Workflows via LMM block |
| Edge Deployment (Jetson) | ❌ Not supported |
| Self-Hosting | ✅ Deploy with inference-models (GPU required) |
Installation¶
Install with PyTorch GPU extras:
- PyTorch:
torch-cu118,torch-cu124,torch-cu126,torch-cu128,torch-jp6-cu126
Recognition Methods¶
GLM-OCR provides three convenience methods for common OCR tasks, plus a general prompt() method for custom prompts:
recognize_text(images)- General text recognition (uses"Text Recognition:"prompt)recognize_formula(images)- Mathematical formula recognition (uses"Formula Recognition:"prompt)recognize_table(images)- Table structure recognition (uses"Table Recognition:"prompt)prompt(images, prompt="...")- Custom prompt for any recognition task
All methods return List[str] and accept the same optional parameters: max_new_tokens, do_sample, skip_special_tokens.
Usage Examples¶
Text Recognition¶
import cv2
from inference_models import AutoModel
model = AutoModel.from_pretrained(
"glm-ocr",
api_key="your_roboflow_api_key"
)
image = cv2.imread("path/to/image.jpg")
# Using the convenience method
results = model.recognize_text(images=image)
print(f"Recognized text: {results[0]}")
Formula Recognition¶
image = cv2.imread("path/to/equation.png")
results = model.recognize_formula(images=image)
print(f"Formula: {results[0]}")
Table Recognition¶
image = cv2.imread("path/to/table.png")
results = model.recognize_table(images=image)
print(f"Table: {results[0]}")
Custom Prompt¶
image = cv2.imread("path/to/serial_number.png")
results = model.prompt(
images=image,
prompt="Read the serial number in this image:",
max_new_tokens=100
)
print(f"Serial number: {results[0]}")
Output Format¶
GLM-OCR returns a List[str] containing the recognized text from the input images.
- Single image input: Returns a list with one string
- The default prompt is
"Text Recognition:"if none is provided
Performance Tips¶
- Use GPU - GLM-OCR requires GPU with bfloat16 support
- Flash Attention - Automatically enabled on Ampere+ GPUs (compute capability >= 8.0) for faster inference
- Adjust max_new_tokens - Increase for longer text passages, decrease for short labels
- Use convenience methods -
recognize_text(),recognize_formula(),recognize_table()use optimized prompts for each task