TensorRT Compilation¶
TensorRT (TRT) is NVIDIA's high-performance deep learning inference optimizer and runtime library. It dramatically accelerates model inference on NVIDIA GPUs through layer fusion, precision calibration, and kernel auto-tuning.
Why TensorRT?¶
TensorRT can provide 2-10x faster inference compared to standard frameworks like PyTorch or ONNX Runtime, with lower latency and higher throughput. However, TensorRT compilation is complex:
- Hardware-specific: Engines must be compiled for specific GPU architectures
- Time-consuming: Compilation can take 10-60 minutes per model
- Technical expertise: Requires understanding of precision modes, batch sizes, and optimization profiles
- Version compatibility: TensorRT versions and CUDA versions must align
How Roboflow Helps¶
Roboflow's platform and inference-models ecosystem simplify TensorRT deployment with three compilation options:
| Feature | Automatic Compilation (RF Cloud) | On-Demand Compilation (RF Cloud) | Local Compilation |
|---|---|---|---|
| Availability | Paid plans (new models) | Paid plans (existing models) | Early access program |
| When to Use | New models trained on platform | Models created before auto-compilation | Any model, any GPU architecture |
| GPU Support | Roboflow platform GPUs | Limited (L4, T4 only) | Any NVIDIA GPU |
| Setup Required | None (automatic) | CLI + workspace whitelisting | Early access enrollment |
| Compilation Location | Roboflow cloud (automatic) | Roboflow cloud (on-demand) | Your own hardware |
| Best For | New training workflows | Retroactive compilation | Custom GPU architectures |
1. Automatic Compilation After Training (Paid Plans)¶
For new models on paid Roboflow plans, TensorRT compilation happens automatically after training completes:
- ✅ Automatic optimization - Models are compiled immediately after training
- ✅ GPU-specific engines - Compiled for GPU devices used on the Roboflow platform
- ✅ Zero configuration - No manual setup or compilation required
- ✅ Production-ready - Optimized engines ready for deployment
New Models Only
Automatic compilation is enabled for new models trained on paid plans. For older models, use the on-demand compilation option below.
All models with TensorRT backend implementation are supported. See the Models Overview page for the complete list of models with TRT backend support.
2. On-Demand Compilation on Roboflow Platform (Experimental)¶
For models created before automatic compilation was enabled, the inference-cli provides an on-demand compilation command that triggers TensorRT compilation jobs on Roboflow's cloud infrastructure.
- 🔄 Compile existing models - Retroactively compile models trained before auto-compilation
- ☁️ Cloud-based - Runs on Roboflow's infrastructure
- ⚠️ Limited GPU support - Only for GPU devices available in Roboflow's cloud (L4, T4)
Limited GPU Support
On-demand compilation only works for limited types of GPU devices available in Roboflow's cloud infrastructure (currently NVIDIA L4 and T4).
This feature requires:
- ✅ Paid Roboflow account
- ✅ Workspace whitelisting - Contact Roboflow support to enable this feature for your workspace
See the On-Demand Compilation section below for detailed usage instructions.
3. Local Compilation CLI (Early Access)¶
For customers who need to compile models for any NVIDIA GPU, Roboflow offers a local compilation CLI that enables:
- 🔧 Compile on any NVIDIA GPU - Use your own hardware for TensorRT compilation
- 🎯 Any GPU architecture - Not limited to cloud-available devices
- ☁️ Automatic artifact registration - Compiled engines are uploaded to Roboflow platform
- 🚀 Seamless deployment -
inference-modelsautomatically downloads and uses registered engines - 🔒 Compile once, use everywhere - Share compiled models across your infrastructure
Early Access Program
The local compilation CLI is currently closed-source and available through our early access program.
Interested? Contact support@roboflow.com to join the early access program and get access to local compilation tools.
On-Demand Compilation on Roboflow Platform¶
This section provides detailed instructions for using the on-demand compilation feature described in option 2 above.
Installation¶
Install the Roboflow CLI:
CLI included with inference
If you have inference installed, the CLI is already available.
Basic Usage¶
Compile a model for specific GPU devices:
inference rf-cloud batch-processing trt-compile \
--model-id <project-id>/<version> \
--device nvidia-l4 \
--device nvidia-t4
Command Options¶
| Option | Description | Required |
|---|---|---|
--model-id, -m |
Model ID to compile (format: workspace/project/version) |
✅ Yes |
--device, -d |
Target GPU device(s) for compilation | ✅ Yes |
--job-id, -j |
Custom job identifier (auto-generated if not provided) | ❌ No |
--notifications-url |
Webhook URL for job completion notifications | ❌ No |
--api-key |
Roboflow API key (uses ROBOFLOW_API_KEY env var if not provided) |
❌ No |
Supported Devices¶
Currently supported compilation targets:
nvidia-l4- NVIDIA L4 GPUnvidia-t4- NVIDIA T4 GPU
Example: Compile for Multiple Devices¶
# Set your API key
export ROBOFLOW_API_KEY="your_api_key_here"
# Compile model for both L4 and T4 GPUs
inference rf-cloud batch-processing trt-compile \
--model-id <project-id>/<version> \
--device nvidia-l4 \
--device nvidia-t4 \
--job-id my-trt-compilation-job
Monitoring Compilation Jobs¶
Check the status of your compilation job:
# List all batch jobs
inference rf-cloud batch-processing list-jobs
# Get details of a specific job
inference rf-cloud batch-processing job-details --job-id my-trt-compilation-job
# View job logs
inference rf-cloud batch-processing logs --job-id my-trt-compilation-job
Using Compiled Models¶
Once compilation completes, the TensorRT engines are automatically available when deploying your model:
from inference_models import AutoModel
# Load model - TRT backend will be used automatically if available
model = AutoModel.from_pretrained(
"<project-id>/<version>",
api_key="your_api_key"
)
# Inference runs on optimized TensorRT engine
results = model(image)
Getting Access¶
To use TRT compilation via CLI (options 2 and 3):
- Upgrade to a paid plan - Visit Roboflow Pricing
- Contact support - Email support@roboflow.com to request workspace whitelisting
- Provide workspace ID - Include your workspace name in the request
Best Practices¶
- Compile for your deployment GPU - Ensure you compile for the same GPU architecture you'll use in production
- Test before production - Validate TRT model accuracy matches your ONNX/PyTorch baseline
- Monitor compilation time - Large models can take 30-60 minutes to compile