Skip to content

establish_trt_cuda_graph_cache

inference_models.models.common.trt.establish_trt_cuda_graph_cache

establish_trt_cuda_graph_cache(default_cuda_graph_cache_size, cuda_graph_cache=None)

Establish a CUDA graph cache for TensorRT inference acceleration.

Resolves which CUDA graph cache to use for a TRT model. If the caller provides a cache instance, it is returned as-is. Otherwise, the function checks the ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND environment variable to decide whether to create a new cache automatically. When the environment variable is disabled (the default), no cache is created and CUDA graphs are not used.

This function is typically called inside from_pretrained() of TRT model classes. End users who want explicit control should create a TRTCudaGraphCache themselves and pass it to AutoModel.from_pretrained.

Parameters:

  • default_cuda_graph_cache_size

    (int) –

    Maximum number of CUDA graphs to cache when a new cache is created automatically. Each entry holds a dedicated TensorRT execution context and GPU memory buffers, so higher values increase VRAM usage.

  • cuda_graph_cache

    (Optional[TRTCudaGraphCache], default: None ) –

    Optional pre-existing cache instance. When provided, it is returned directly and the environment variable is ignored. This allows callers to share a single cache across multiple models or to configure capacity explicitly.

Returns:

  • Optional[TRTCudaGraphCache]

    A TRTCudaGraphCache instance if CUDA graphs should be used, or

  • Optional[TRTCudaGraphCache]

    None if they are disabled. When None is returned, the model

  • Optional[TRTCudaGraphCache]

    falls back to standard TensorRT execution without graph capture.

Examples:

Automatic cache creation via environment variable:

>>> import os
>>> os.environ["ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND"] = "True"
>>>
>>> from inference_models.developer_tools import (
...     establish_trt_cuda_graph_cache,
... )
>>>
>>> cache = establish_trt_cuda_graph_cache(default_cuda_graph_cache_size=8)
>>> print(type(cache))  # <class 'TRTCudaGraphCache'>

Caller-provided cache takes priority:

>>> from inference_models.models.common.trt import (
...     TRTCudaGraphCache,
...     establish_trt_cuda_graph_cache,
... )
>>>
>>> my_cache = TRTCudaGraphCache(capacity=32)
>>> result = establish_trt_cuda_graph_cache(
...     default_cuda_graph_cache_size=8,
...     cuda_graph_cache=my_cache,
... )
>>> assert result is my_cache  # returned as-is

Typical usage inside a model's from_pretrained:

>>> cache = establish_trt_cuda_graph_cache(
...     default_cuda_graph_cache_size=8,
...     cuda_graph_cache=None,  # let env var decide
... )
>>> # cache is None when env var is disabled (default)
Note
  • The environment variable ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND defaults to False
  • When a caller-provided cache is given, the environment variable is not checked
  • CUDA graphs require TensorRT and a CUDA-capable GPU
  • Each cached graph consumes VRAM proportional to the model's execution context size
See Also
  • TRTCudaGraphCache: The LRU cache class for CUDA graph state
  • infer_from_trt_engine(): Uses the cache during TRT inference