TensorRT Compilation¶

TensorRT (TRT) is NVIDIA's high-performance deep learning inference optimizer and runtime library. It dramatically accelerates model inference on NVIDIA GPUs through layer fusion, precision calibration, and kernel auto-tuning.

Why TensorRT?¶

TensorRT can provide 2-10x faster inference compared to standard frameworks like PyTorch or ONNX Runtime, with lower latency and higher throughput. However, TensorRT compilation is complex:

Hardware-specific: Engines must be compiled for specific GPU architectures
Time-consuming: Compilation can take 10-60 minutes per model
Technical expertise: Requires understanding of precision modes, batch sizes, and optimization profiles
Version compatibility: TensorRT versions and CUDA versions must align

How Roboflow Helps¶

Roboflow's platform and inference-models ecosystem simplify TensorRT deployment with three compilation options:

Feature	Automatic Compilation (RF Cloud)	On-Demand Compilation (RF Cloud)	Local Compilation
Availability	Paid plans (new models)	Paid plans (existing models)	Early access program
When to Use	New models trained on platform	Models created before auto-compilation	Any model, any GPU architecture
GPU Support	Roboflow platform GPUs	Limited (L4, T4 only)	Any NVIDIA GPU
Setup Required	None (automatic)	CLI + workspace whitelisting	Early access enrollment
Compilation Location	Roboflow cloud (automatic)	Roboflow cloud (on-demand)	Your own hardware
Best For	New training workflows	Retroactive compilation	Custom GPU architectures

1. Automatic Compilation After Training (Paid Plans)¶

For new models on paid Roboflow plans, TensorRT compilation happens automatically after training completes:

✅ Automatic optimization - Models are compiled immediately after training
✅ GPU-specific engines - Compiled for GPU devices used on the Roboflow platform
✅ Zero configuration - No manual setup or compilation required
✅ Production-ready - Optimized engines ready for deployment

New Models Only

Automatic compilation is enabled for new models trained on paid plans. For older models, use the on-demand compilation option below.

All models with TensorRT backend implementation are supported. See the Models Overview page for the complete list of models with TRT backend support.

2. On-Demand Compilation on Roboflow Platform (Experimental)¶

For models created before automatic compilation was enabled, the inference-cli provides an on-demand compilation command that triggers TensorRT compilation jobs on Roboflow's cloud infrastructure.

🔄 Compile existing models - Retroactively compile models trained before auto-compilation
☁️ Cloud-based - Runs on Roboflow's infrastructure
⚠️ Limited GPU support - Only for GPU devices available in Roboflow's cloud (L4, T4)

Limited GPU Support

On-demand compilation only works for limited types of GPU devices available in Roboflow's cloud infrastructure (currently NVIDIA L4 and T4).

This feature requires:

✅ Paid Roboflow account
✅ Workspace whitelisting - Contact Roboflow support to enable this feature for your workspace

See the On-Demand Compilation section below for detailed usage instructions.

3. Local Compilation CLI (Early Access)¶

For customers who need to compile models for any NVIDIA GPU, Roboflow offers a local compilation CLI that enables:

🔧 Compile on any NVIDIA GPU - Use your own hardware for TensorRT compilation
🎯 Any GPU architecture - Not limited to cloud-available devices
☁️ Automatic artifact registration - Compiled engines are uploaded to Roboflow platform
🚀 Seamless deployment - inference-models automatically downloads and uses registered engines
🔒 Compile once, use everywhere - Share compiled models across your infrastructure

Early Access Program

The local compilation CLI is currently closed-source and available through our early access program.

Interested? Contact support@roboflow.com to join the early access program and get access to local compilation tools.

On-Demand Compilation on Roboflow Platform¶

This section provides detailed instructions for using the on-demand compilation feature described in option 2 above.

Installation¶

Install the Roboflow CLI:

uvpip

uv pip install inference-cli

pip install inference-cli

CLI included with inference

If you have inference installed, the CLI is already available.

Basic Usage¶

Compile a model for specific GPU devices:

inference rf-cloud batch-processing trt-compile \
    --model-id <project-id>/<version> \
    --device nvidia-l4 \
    --device nvidia-t4

Command Options¶

Option	Description	Required
`--model-id`, `-m`	Model ID to compile (format: `workspace/project/version`)	✅ Yes
`--device`, `-d`	Target GPU device(s) for compilation	✅ Yes
`--job-id`, `-j`	Custom job identifier (auto-generated if not provided)	❌ No
`--notifications-url`	Webhook URL for job completion notifications	❌ No
`--api-key`	Roboflow API key (uses `ROBOFLOW_API_KEY` env var if not provided)	❌ No

Supported Devices¶

Currently supported compilation targets:

nvidia-l4 - NVIDIA L4 GPU
nvidia-t4 - NVIDIA T4 GPU

Example: Compile for Multiple Devices¶

# Set your API key
export ROBOFLOW_API_KEY="your_api_key_here"

# Compile model for both L4 and T4 GPUs
inference rf-cloud batch-processing trt-compile \
    --model-id <project-id>/<version> \
    --device nvidia-l4 \
    --device nvidia-t4 \
    --job-id my-trt-compilation-job

Monitoring Compilation Jobs¶

Check the status of your compilation job:

# List all batch jobs
inference rf-cloud batch-processing list-jobs

# Get details of a specific job
inference rf-cloud batch-processing job-details --job-id my-trt-compilation-job

# View job logs
inference rf-cloud batch-processing logs --job-id my-trt-compilation-job

Using Compiled Models¶

Once compilation completes, the TensorRT engines are automatically available when deploying your model:

from inference_models import AutoModel

# Load model - TRT backend will be used automatically if available
model = AutoModel.from_pretrained(
    "<project-id>/<version>",
    api_key="your_api_key"
)

# Inference runs on optimized TensorRT engine
results = model(image)

Getting Access¶

To use TRT compilation via CLI (options 2 and 3):

Upgrade to a paid plan - Visit Roboflow Pricing
Contact support - Email support@roboflow.com to request workspace whitelisting
Provide workspace ID - Include your workspace name in the request

Best Practices¶

Compile for your deployment GPU - Ensure you compile for the same GPU architecture you'll use in production
Test before production - Validate TRT model accuracy matches your ONNX/PyTorch baseline
Monitor compilation time - Large models can take 30-60 minutes to compile