GLM-OCR - Vision-Language OCR¶

GLM-OCR is a vision-language model developed by Zhipu AI (ZAI) that excels at optical character recognition using an image-text-to-text architecture. It combines visual understanding with text generation for accurate text recognition.

Overview¶

Resources: Hugging Face Model

GLM-OCR provides high-quality text recognition using a modern vision-language approach. Key features include:

Vision-language architecture - Uses AutoModelForImageTextToText for end-to-end OCR
Prompt-based - Customizable prompts for different recognition tasks
High accuracy - Strong performance on diverse text types
Flash Attention support - Automatic flash attention on supported GPUs (Ampere+)

OCR Type: Unstructured OCR¶

Unstructured OCR recognizes text from an image using a vision-language model. It returns the recognized text string based on the given prompt.

When to Use GLM-OCR¶

Serial numbers & labels - Recognizing text from product labels, serial numbers
Scene text - Text in natural images and photos
Document text - Recognizing text from document images
Custom prompts - When you need to guide the model with specific instructions

When to Use Other OCR Models¶

DocTR: Better for full document layout analysis with text detection and bounding boxes
EasyOCR: Better for multi-language scene text with detection and localization
TrOCR: Lighter alternative for simple single-line pre-cropped text

License¶

MIT License

Open Source License

GLM-OCR is licensed under MIT by Zhipu AI (ZAI), making it free for both commercial and non-commercial use without restrictions.

Learn more: MIT License

Pre-trained Model IDs¶

Pre-trained GLM-OCR models are available via the Roboflow API and require a Roboflow API key.

Getting a Roboflow API Key

To use GLM-OCR models, you'll need a Roboflow account (free) and API key.

Model ID	Description
`glm-ocr`	GLM-OCR vision-language model for text recognition

GPU Required

GLM-OCR uses bfloat16 precision and requires GPU acceleration. CPU inference is not supported.

Supported Backends¶

Backend	Extras Required
`torch`	`torch-cu118`, `torch-cu124`, `torch-cu126`, `torch-cu128`, `torch-jp6-cu126`

Roboflow Platform Compatibility¶

Feature	Supported
Training	❌ Not available for training
Upload Weights	❌ Not supported
Serverless API (v2)	✅ Deploy via hosted API
Workflows	✅ Use in Workflows via LMM block
Edge Deployment (Jetson)	❌ Not supported
Self-Hosting	✅ Deploy with `inference-models` (GPU required)

Installation¶

Install with PyTorch GPU extras:

PyTorch: torch-cu118, torch-cu124, torch-cu126, torch-cu128, torch-jp6-cu126

Recognition Methods¶

GLM-OCR provides three convenience methods for common OCR tasks, plus a general prompt() method for custom prompts:

recognize_text(images) - General text recognition (uses "Text Recognition:" prompt)
recognize_formula(images) - Mathematical formula recognition (uses "Formula Recognition:" prompt)
recognize_table(images) - Table structure recognition (uses "Table Recognition:" prompt)
prompt(images, prompt="...") - Custom prompt for any recognition task

All methods return List[str] and accept the same optional parameters: max_new_tokens, do_sample, skip_special_tokens.

Usage Examples¶

Text Recognition¶

import cv2
from inference_models import AutoModel

model = AutoModel.from_pretrained(
    "glm-ocr",
    api_key="your_roboflow_api_key"
)

image = cv2.imread("path/to/image.jpg")

# Using the convenience method
results = model.recognize_text(images=image)
print(f"Recognized text: {results[0]}")

Formula Recognition¶

image = cv2.imread("path/to/equation.png")

results = model.recognize_formula(images=image)
print(f"Formula: {results[0]}")

Table Recognition¶

image = cv2.imread("path/to/table.png")

results = model.recognize_table(images=image)
print(f"Table: {results[0]}")

Custom Prompt¶

image = cv2.imread("path/to/serial_number.png")

results = model.prompt(
    images=image,
    prompt="Read the serial number in this image:",
    max_new_tokens=100
)
print(f"Serial number: {results[0]}")

Output Format¶

GLM-OCR returns a List[str] containing the recognized text from the input images.

Single image input: Returns a list with one string
The default prompt is "Text Recognition:" if none is provided

Performance Tips¶

Use GPU - GLM-OCR requires GPU with bfloat16 support
Flash Attention - Automatically enabled on Ampere+ GPUs (compute capability >= 8.0) for faster inference
Adjust max_new_tokens - Increase for longer text passages, decrease for short labels
Use convenience methods - recognize_text(), recognize_formula(), recognize_table() use optimized prompts for each task