run_onnx_session_via_iobinding¶
inference_models.models.common.onnx.run_onnx_session_via_iobinding
¶
Run ONNX inference session using IO binding for optimal GPU performance.
Executes an ONNX model using ONNX Runtime's IO binding API, which provides better performance on CUDA devices by avoiding unnecessary memory copies between CPU and GPU. For CPU inference, falls back to standard execution.
IO binding allows direct binding of GPU tensors to ONNX Runtime, eliminating the need to copy data to CPU and back. This is particularly beneficial for large models and high-throughput scenarios.
Parameters:
-
(session¶InferenceSession) –ONNX Runtime inference session.
-
(inputs¶Dict[str, Tensor]) –Dictionary mapping input names to PyTorch tensors. Tensors can be on CPU or CUDA devices.
-
(output_shape_mapping¶Optional[Dict[str, tuple]], default:None) –Optional dictionary mapping output names to their expected shapes. Used for pre-allocating output buffers on GPU, which improves performance. If not provided or if output has dynamic shape, outputs are allocated dynamically.
Returns:
-
List[Tensor]–List of output tensors from the ONNX model, in the order defined by
-
List[Tensor]–the model's output specification. Tensors are on the same device as inputs.
Examples:
Run inference with IO binding on GPU:
>>> from inference_models.developer_tools import run_onnx_session_via_iobinding
>>> import onnxruntime as ort
>>> import torch
>>>
>>> session = ort.InferenceSession(
... "model.onnx",
... providers=["CUDAExecutionProvider"]
... )
>>>
>>> inputs = {
... "images": torch.randn(1, 3, 640, 640, device="cuda:0")
... }
>>>
>>> outputs = run_onnx_session_via_iobinding(
... session=session,
... inputs=inputs
... )
>>> # Returns list of tensors on cuda:0
Pre-allocate outputs for better performance:
>>> output_shapes = {
... "output0": (1, 84, 8400), # Detection output shape
... }
>>>
>>> outputs = run_onnx_session_via_iobinding(
... session=session,
... inputs=inputs,
... output_shape_mapping=output_shapes
... )
>>> # Outputs are pre-allocated, avoiding dynamic allocation overhead
CPU inference (automatic fallback):
>>> inputs_cpu = {
... "images": torch.randn(1, 3, 640, 640) # CPU tensor
... }
>>>
>>> outputs = run_onnx_session_via_iobinding(
... session=session,
... inputs=inputs_cpu
... )
>>> # Automatically uses standard execution for CPU
Note
- Automatically casts input types to match model requirements
- Uses IO binding for CUDA devices, standard execution for CPU
- Requires PyCUDA for CUDA execution
- Pre-allocating outputs via output_shape_mapping improves performance
- Handles both static and dynamic output shapes
See Also
run_onnx_session_with_batch_size_limit(): Higher-level function with batchingset_onnx_execution_provider_defaults(): Configure execution providers