TensorRT Compilation¶
TensorRT (TRT) is NVIDIA's high-performance deep learning inference optimizer and runtime library. It dramatically accelerates model inference on NVIDIA GPUs through layer fusion, precision calibration, and kernel auto-tuning.
Why TensorRT?¶
TensorRT can provide 2-10x faster inference compared to standard frameworks like PyTorch or ONNX Runtime, with lower latency and higher throughput. However, TensorRT compilation is complex:
- Hardware-specific: Engines must be compiled for a specific GPU architecture
- Time-consuming: Compilation can take 10-60 minutes per model
- Technical expertise: Requires understanding of precision modes, batch sizes, and optimization profiles
- Version-sensitive: TensorRT and CUDA versions must match between compilation and runtime
How Roboflow Helps¶
Roboflow's platform and inference-models ecosystem simplify TensorRT deployment with three compilation options:
| Feature | Automatic Compilation (RF Cloud) | On-Demand Compilation (RF Cloud) | Local Compilation |
|---|---|---|---|
| Availability | Paid plans (new models) | Paid plans (existing models) | Early access program |
| When to Use | New models trained on the platform | Models created before auto-compilation | Any model, any GPU architecture |
| GPU Support | Roboflow platform GPUs | L4, T4, L40S | Any NVIDIA GPU |
| Setup Required | None (automatic) | CLI + workspace whitelisting | Early access enrollment |
| Compilation Location | Roboflow cloud (automatic) | Roboflow cloud (on-demand) | Your own hardware |
| Best For | New training workflows | Retroactive compilation | Custom GPU architectures |
1. Automatic Compilation After Training (Paid Plans)¶
For new models on paid Roboflow plans, TensorRT compilation happens automatically after training completes:
- ✅ Automatic optimization - Models are compiled immediately after training
- ✅ GPU-specific engines - Compiled for the GPU devices available on the Roboflow platform
- ✅ Zero configuration - No manual setup or compilation required
- ✅ Production-ready - Optimized engines ready for deployment
New Models Only
Automatic compilation is enabled for new models trained on paid plans. For older models, use the on-demand compilation option below.
All models with TensorRT backend implementation are supported. See the Models Overview page for the complete list of models with TRT backend support.
2. On-Demand Compilation on Roboflow Platform (Experimental)¶
For models created before automatic compilation was available, the inference-cli provides an on-demand compilation command that triggers TensorRT compilation jobs on Roboflow's cloud infrastructure.
- 🔄 Compile existing models - Retroactively compile models trained before auto-compilation
- ☁️ Cloud-based - Runs on Roboflow's infrastructure
- ⚠️ Limited GPU support - Restricted to GPU types available in Roboflow's cloud (L4, T4, L40S)
Limited GPU Support
On-demand compilation is limited to the GPU types available in Roboflow's cloud infrastructure (currently NVIDIA L4, T4, and L40S).
This feature requires:
- ✅ Paid Roboflow account
- ✅ Workspace whitelisting - Contact Roboflow support to enable this feature for your workspace
See the On-Demand Compilation section below for detailed usage instructions.
3. Local Compilation CLI (Early Access)¶
For customers who need to compile models for any NVIDIA GPU or Jetson device, Roboflow offers a local compilation CLI that enables:
- 🔧 Compile on any NVIDIA GPU - Use your own hardware for TensorRT compilation
- 🎯 Any GPU architecture - Not limited to cloud-available devices
- ☁️ Automatic artifact registration - Compiled engines are uploaded to the Roboflow platform
- 🚀 Seamless deployment -
inference-modelsautomatically downloads and uses the compiled engines - 🔒 Compile once, deploy everywhere - Share compiled models across your infrastructure
pip install inference-cli
inference enterprise inference-compiler compile-model \
--model-id <project-id>/<version> \
--api-key <your_api_key>
On-Demand Compilation on Roboflow Platform¶
Detailed instructions for using the on-demand cloud compilation feature.
Installation¶
Install the Roboflow CLI:
CLI included with inference
If you have inference installed, the CLI is already available.
Basic Usage¶
Compile a model for specific GPU devices:
inference rf-cloud batch-processing trt-compile \
--model-id <project-id>/<version> \
--device nvidia-l4 \
--device nvidia-t4
Command Options¶
| Option | Description | Required |
|---|---|---|
--model-id, -m |
Model ID to compile (format: workspace/project/version) |
✅ Yes |
--device, -d |
Target GPU device(s) for compilation | ✅ Yes |
--job-id, -j |
Custom job identifier (auto-generated if not provided) | ❌ No |
--notifications-url |
Webhook URL for job completion notifications | ❌ No |
--api-key |
Roboflow API key (uses ROBOFLOW_API_KEY env var if not provided) |
❌ No |
Supported Devices¶
Currently supported compilation targets:
nvidia-l4- NVIDIA L4 GPUnvidia-t4- NVIDIA T4 GPUnvidia-l40s- NVIDIA L40S GPU
Example: Compile for Multiple Devices¶
# Set your API key
export ROBOFLOW_API_KEY="your_api_key_here"
# Compile model for L4, T4, and L40S GPUs
inference rf-cloud batch-processing trt-compile \
--model-id <project-id>/<version> \
--device nvidia-l4 \
--device nvidia-t4 \
--device nvidia-l40s \
--job-id my-trt-compilation-job
Monitoring Compilation Jobs¶
Check the status of your compilation job:
# List all batch jobs
inference rf-cloud batch-processing list-jobs
# Get details of a specific job
inference rf-cloud batch-processing job-details --job-id my-trt-compilation-job
# View job logs
inference rf-cloud batch-processing logs --job-id my-trt-compilation-job
Using Compiled Models¶
Once compilation completes, the TensorRT engine is automatically available when loading your model:
from inference_models import AutoModel
# TRT backend is used automatically when a compiled engine is available
model = AutoModel.from_pretrained(
"<project-id>/<version>",
api_key="your_api_key"
)
# Runs on the optimized TensorRT engine
results = model(image)
Getting Access¶
To use TRT compilation via the CLI (options 2 and 3):
- Upgrade to a paid plan - Visit Roboflow Pricing
- Contact support - Email support@roboflow.com to request access for your workspace
- Provide your workspace ID - Include your workspace name in the request
Best Practices¶
- Compile on your deployment hardware - TRT engines are not portable across GPU architectures, so compile on the same hardware (or same compute capability) you will use in production
- Validate accuracy - Verify that the TRT model's output matches your ONNX or PyTorch baseline before deploying
- Plan for compilation time - Large models can take 30-60 minutes to compile