TensorRT Compilation¶

TensorRT (TRT) is NVIDIA's high-performance deep learning inference optimizer and runtime library. It dramatically accelerates model inference on NVIDIA GPUs through layer fusion, precision calibration, and kernel auto-tuning.

Why TensorRT?¶

TensorRT can provide 2-10x faster inference compared to standard frameworks like PyTorch or ONNX Runtime, with lower latency and higher throughput. However, TensorRT compilation is complex:

Hardware-specific: Engines must be compiled for a specific GPU architecture
Time-consuming: Compilation can take 10-60 minutes per model
Technical expertise: Requires understanding of precision modes, batch sizes, and optimization profiles
Version-sensitive: TensorRT and CUDA versions must match between compilation and runtime

How Roboflow Helps¶

Roboflow's platform and inference-models ecosystem simplify TensorRT deployment with three compilation options:

Feature	Automatic Compilation (RF Cloud)	On-Demand Compilation (RF Cloud)	Local Compilation
Availability	Paid plans (new models)	Paid plans (existing models)	Early access program
When to Use	New models trained on the platform	Models created before auto-compilation	Any model, any GPU architecture
GPU Support	Roboflow platform GPUs	L4, T4, L40S	Any NVIDIA GPU
Setup Required	None (automatic)	CLI + workspace whitelisting	Early access enrollment
Compilation Location	Roboflow cloud (automatic)	Roboflow cloud (on-demand)	Your own hardware
Best For	New training workflows	Retroactive compilation	Custom GPU architectures

1. Automatic Compilation After Training (Paid Plans)¶

For new models on paid Roboflow plans, TensorRT compilation happens automatically after training completes:

✅ Automatic optimization - Models are compiled immediately after training
✅ GPU-specific engines - Compiled for the GPU devices available on the Roboflow platform
✅ Zero configuration - No manual setup or compilation required
✅ Production-ready - Optimized engines ready for deployment

New Models Only

Automatic compilation is enabled for new models trained on paid plans. For older models, use the on-demand compilation option below.

All models with TensorRT backend implementation are supported. See the Models Overview page for the complete list of models with TRT backend support.

2. On-Demand Compilation on Roboflow Platform (Experimental)¶

For models created before automatic compilation was available, the inference-cli provides an on-demand compilation command that triggers TensorRT compilation jobs on Roboflow's cloud infrastructure.

🔄 Compile existing models - Retroactively compile models trained before auto-compilation
☁️ Cloud-based - Runs on Roboflow's infrastructure
⚠️ Limited GPU support - Restricted to GPU types available in Roboflow's cloud (L4, T4, L40S)

Limited GPU Support

On-demand compilation is limited to the GPU types available in Roboflow's cloud infrastructure (currently NVIDIA L4, T4, and L40S).

This feature requires:

✅ Paid Roboflow account
✅ Workspace whitelisting - Contact Roboflow support to enable this feature for your workspace

See the On-Demand Compilation section below for detailed usage instructions.

3. Local Compilation CLI (Early Access)¶

For customers who need to compile models for any NVIDIA GPU or Jetson device, Roboflow offers a local compilation CLI that enables:

🔧 Compile on any NVIDIA GPU - Use your own hardware for TensorRT compilation
🎯 Any GPU architecture - Not limited to cloud-available devices
☁️ Automatic artifact registration - Compiled engines are uploaded to the Roboflow platform
🚀 Seamless deployment - inference-models automatically downloads and uses the compiled engines
🔒 Compile once, deploy everywhere - Share compiled models across your infrastructure

pip install inference-cli
inference enterprise inference-compiler compile-model \
    --model-id <project-id>/<version> \
    --api-key <your_api_key>

On-Demand Compilation on Roboflow Platform¶

Detailed instructions for using the on-demand cloud compilation feature.

Installation¶

Install the Roboflow CLI:

uvpip

uv pip install inference-cli

pip install inference-cli

CLI included with inference

If you have inference installed, the CLI is already available.

Basic Usage¶

Compile a model for specific GPU devices:

inference rf-cloud batch-processing trt-compile \
    --model-id <project-id>/<version> \
    --device nvidia-l4 \
    --device nvidia-t4

Command Options¶

Option	Description	Required
`--model-id`, `-m`	Model ID to compile (format: `workspace/project/version`)	✅ Yes
`--device`, `-d`	Target GPU device(s) for compilation	✅ Yes
`--job-id`, `-j`	Custom job identifier (auto-generated if not provided)	❌ No
`--notifications-url`	Webhook URL for job completion notifications	❌ No
`--api-key`	Roboflow API key (uses `ROBOFLOW_API_KEY` env var if not provided)	❌ No

Supported Devices¶

Currently supported compilation targets:

nvidia-l4 - NVIDIA L4 GPU
nvidia-t4 - NVIDIA T4 GPU
nvidia-l40s - NVIDIA L40S GPU

Example: Compile for Multiple Devices¶

# Set your API key
export ROBOFLOW_API_KEY="your_api_key_here"

# Compile model for L4, T4, and L40S GPUs
inference rf-cloud batch-processing trt-compile \
    --model-id <project-id>/<version> \
    --device nvidia-l4 \
    --device nvidia-t4 \
    --device nvidia-l40s \
    --job-id my-trt-compilation-job

Monitoring Compilation Jobs¶

Check the status of your compilation job:

# List all batch jobs
inference rf-cloud batch-processing list-jobs

# Get details of a specific job
inference rf-cloud batch-processing job-details --job-id my-trt-compilation-job

# View job logs
inference rf-cloud batch-processing logs --job-id my-trt-compilation-job

Using Compiled Models¶

Once compilation completes, the TensorRT engine is automatically available when loading your model:

from inference_models import AutoModel

# TRT backend is used automatically when a compiled engine is available
model = AutoModel.from_pretrained(
    "<project-id>/<version>",
    api_key="your_api_key"
)

# Runs on the optimized TensorRT engine
results = model(image)

Getting Access¶

To use TRT compilation via the CLI (options 2 and 3):

Upgrade to a paid plan - Visit Roboflow Pricing
Contact support - Email support@roboflow.com to request access for your workspace
Provide your workspace ID - Include your workspace name in the request

Best Practices¶

Compile on your deployment hardware - TRT engines are not portable across GPU architectures, so compile on the same hardware (or same compute capability) you will use in production
Validate accuracy - Verify that the TRT model's output matches your ONNX or PyTorch baseline before deploying
Plan for compilation time - Large models can take 30-60 minutes to compile