establish_trt_cuda_graph_cache¶
inference_models.models.common.trt.establish_trt_cuda_graph_cache
¶
Establish a CUDA graph cache for TensorRT inference acceleration.
Resolves which CUDA graph cache to use for a TRT model. If the caller
provides a cache instance, it is returned as-is. Otherwise, the function
checks the ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND environment variable
to decide whether to create a new cache automatically. When the environment
variable is disabled (the default), no cache is created and CUDA graphs
are not used.
This function is typically called inside from_pretrained() of TRT model
classes. End users who want explicit control should create a
TRTCudaGraphCache themselves and pass it to AutoModel.from_pretrained.
Parameters:
-
(default_cuda_graph_cache_size¶int) –Maximum number of CUDA graphs to cache when a new cache is created automatically. Each entry holds a dedicated TensorRT execution context and GPU memory buffers, so higher values increase VRAM usage.
-
(cuda_graph_cache¶Optional[TRTCudaGraphCache], default:None) –Optional pre-existing cache instance. When provided, it is returned directly and the environment variable is ignored. This allows callers to share a single cache across multiple models or to configure capacity explicitly.
Returns:
-
Optional[TRTCudaGraphCache]–A
TRTCudaGraphCacheinstance if CUDA graphs should be used, or -
Optional[TRTCudaGraphCache]–Noneif they are disabled. WhenNoneis returned, the model -
Optional[TRTCudaGraphCache]–falls back to standard TensorRT execution without graph capture.
Examples:
Automatic cache creation via environment variable:
>>> import os
>>> os.environ["ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKEND"] = "True"
>>>
>>> from inference_models.developer_tools import (
... establish_trt_cuda_graph_cache,
... )
>>>
>>> cache = establish_trt_cuda_graph_cache(default_cuda_graph_cache_size=8)
>>> print(type(cache)) # <class 'TRTCudaGraphCache'>
Caller-provided cache takes priority:
>>> from inference_models.models.common.trt import (
... TRTCudaGraphCache,
... establish_trt_cuda_graph_cache,
... )
>>>
>>> my_cache = TRTCudaGraphCache(capacity=32)
>>> result = establish_trt_cuda_graph_cache(
... default_cuda_graph_cache_size=8,
... cuda_graph_cache=my_cache,
... )
>>> assert result is my_cache # returned as-is
Typical usage inside a model's from_pretrained:
>>> cache = establish_trt_cuda_graph_cache(
... default_cuda_graph_cache_size=8,
... cuda_graph_cache=None, # let env var decide
... )
>>> # cache is None when env var is disabled (default)
Note
- The environment variable
ENABLE_AUTO_CUDA_GRAPHS_FOR_TRT_BACKENDdefaults toFalse - When a caller-provided cache is given, the environment variable is not checked
- CUDA graphs require TensorRT and a CUDA-capable GPU
- Each cached graph consumes VRAM proportional to the model's execution context size
See Also
TRTCudaGraphCache: The LRU cache class for CUDA graph stateinfer_from_trt_engine(): Uses the cache during TRT inference