establish_trt_cuda_graph_cache¶

inference_models.models.common.trt.establish_trt_cuda_graph_cache ¶

establish_trt_cuda_graph_cache(default_cuda_graph_cache_size, cuda_graph_cache=None)

Establish a CUDA graph cache for TensorRT inference acceleration.

Resolves which CUDA graph cache to use for a TRT model. If the caller provides a cache instance, it is returned as-is. Otherwise, the function checks the ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND environment variable to decide whether to create a new cache automatically. When the environment variable is disabled (the default), no cache is created and CUDA graphs are not used.

This function is typically called inside from_pretrained() of TRT model classes. End users who want explicit control should create a TRTCudaGraphCache themselves and pass it to AutoModel.from_pretrained.

Parameters:

default_cuda_graph_cache_size ¶
(int) –

Maximum number of CUDA graphs to cache when a new cache is created automatically. Each entry holds a dedicated TensorRT execution context and GPU memory buffers, so higher values increase VRAM usage.
cuda_graph_cache ¶
(Optional[TRTCudaGraphCache], default: None ) –

Optional pre-existing cache instance. When provided, it is returned directly and the environment variable is ignored. This allows callers to share a single cache across multiple models or to configure capacity explicitly.

Returns:

Optional[TRTCudaGraphCache] –

A TRTCudaGraphCache instance if CUDA graphs should be used, or
Optional[TRTCudaGraphCache] –

None if they are disabled. When None is returned, the model
Optional[TRTCudaGraphCache] –

falls back to standard TensorRT execution without graph capture.

Examples:

Automatic cache creation via environment variable:

>>> import os
>>> os.environ["ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND"] = "True"
>>>
>>> from inference_models.developer_tools import (
...     establish_trt_cuda_graph_cache,
... )
>>>
>>> cache = establish_trt_cuda_graph_cache(default_cuda_graph_cache_size=8)
>>> print(type(cache))  # <class 'TRTCudaGraphCache'>

Caller-provided cache takes priority:

>>> from inference_models.models.common.trt import (
...     TRTCudaGraphCache,
...     establish_trt_cuda_graph_cache,
... )
>>>
>>> my_cache = TRTCudaGraphCache(capacity=32)
>>> result = establish_trt_cuda_graph_cache(
...     default_cuda_graph_cache_size=8,
...     cuda_graph_cache=my_cache,
... )
>>> assert result is my_cache  # returned as-is

Typical usage inside a model's from_pretrained:

>>> cache = establish_trt_cuda_graph_cache(
...     default_cuda_graph_cache_size=8,
...     cuda_graph_cache=None,  # let env var decide
... )
>>> # cache is None when env var is disabled (default)

Note

The environment variable ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND defaults to False
When a caller-provided cache is given, the environment variable is not checked
CUDA graphs require TensorRT and a CUDA-capable GPU
Each cached graph consumes VRAM proportional to the model's execution context size

establish_trt_cuda_graph_cache¶

inference_models.models.common.trt.establish_trt_cuda_graph_cache ¶

`default_cuda_graph_cache_size` ¶

`cuda_graph_cache` ¶

establish_trt_cuda_graph_cache¶

inference_models.models.common.trt.establish_trt_cuda_graph_cache ¶

default_cuda_graph_cache_size ¶

cuda_graph_cache ¶

`default_cuda_graph_cache_size` ¶

`cuda_graph_cache` ¶