ViT - Classification¶

Vision Transformer (ViT) is a transformer-based architecture that applies the transformer model directly to image patches, achieving excellent performance on image classification tasks.

Overview¶

ViT for classification brings the power of transformers to computer vision. Key features include:

Transformer architecture - Self-attention mechanisms for global context
Patch-based processing - Images divided into patches and processed as sequences
Scalable design - Performance improves with model size and data
Strong transfer learning - Excellent feature representations for fine-tuning
Multiple variants - Different sizes and patch configurations available

License¶

Apache 2.0

Open Source License

Vision Transformer is licensed under Apache 2.0, making it free for both commercial and non-commercial use without restrictions.

Learn more: Apache 2.0 License

Pre-trained Model IDs¶

ViT models must be trained on Roboflow or uploaded as custom weights. There are no pre-trained public model IDs available for classification tasks.

Custom model ID format: project-url/version (e.g., my-project-abc123/2)

Supported Backends¶

Backend	Extras Required
`onnx`	`onnx-cpu`, `onnx-cu12`, `onnx-cu118`, `onnx-jp6-cu126`
`hugging-face`	`torch-cpu`, `torch-cu118`, `torch-cu124`, `torch-cu126`, `torch-cu128`, `torch-jp6-cu126`
`trt`	`trt10`

Roboflow Platform Compatibility¶

Feature	Supported
Training	✅ Train custom models on Roboflow
Upload Weights	✅ Upload pre-trained weights (guide)
Serverless API (v2)	✅ Deploy via hosted API
Workflows	✅ Use in Workflows via Classification block
Edge Deployment (Jetson)	✅ Deploy on NVIDIA Jetson devices
Self-Hosting	✅ Deploy with `inference-models`

Installation¶

Install with one of the following extras depending on your backend:

ONNX: onnx-cpu, onnx-cu12
TensorRT: trt10 (requires CUDA 12.x)
Hugging Face (PyTorch): torch-cpu, torch-cu118, torch-cu124, torch-cu126, torch-cu128, torch-jp6-cu126

Usage Example¶

import cv2
from inference_models import AutoModel

# Load your custom model (requires Roboflow API key)
model = AutoModel.from_pretrained(
    "my-project-abc123/2",
    api_key="your_roboflow_api_key"
)
image = cv2.imread("path/to/image.jpg")

# Run inference
prediction = model(image)

# Get top prediction
top_class_id = prediction.class_id[0].item()
top_class = model.class_names[top_class_id]
confidence = prediction.confidence[0][top_class_id].item()

print(f"Class: {top_class}")
print(f"Confidence: {confidence:.2f}")

# Get all class confidences
for idx, class_name in enumerate(model.class_names):
    conf = prediction.confidence[0][idx].item()
    print(f"{class_name}: {conf:.3f}")