run_onnx_session_with_batch_size_limit¶
inference_models.models.common.onnx.run_onnx_session_with_batch_size_limit
¶
run_onnx_session_with_batch_size_limit(session, inputs, output_shape_mapping=None, max_batch_size=None, min_batch_size=None)
Run ONNX inference session with automatic batch splitting.
Executes an ONNX model with automatic batching when the input batch size exceeds the maximum supported batch size. Splits large batches into smaller chunks, runs inference on each chunk, and concatenates the results.
This is useful for models with static batch size constraints or to avoid GPU memory issues with large batches.
Parameters:
-
(session¶InferenceSession) –ONNX Runtime inference session.
-
(inputs¶Dict[str, Tensor]) –Dictionary mapping input names to PyTorch tensors. All tensors must have the same batch size (first dimension).
-
(output_shape_mapping¶Optional[Dict[str, tuple]], default:None) –Optional dictionary mapping output names to their expected shapes. Used for pre-allocating output buffers. If None, outputs are dynamically allocated.
-
(max_batch_size¶Optional[int], default:None) –Maximum batch size to process at once. If None or if input batch size is smaller, processes the entire batch at once.
-
(min_batch_size¶Optional[int], default:None) –Minimum batch size for the model. If the last chunk is smaller, it will be padded to this size. Useful for models with static batch size requirements.
Returns:
-
List[Tensor]–List of output tensors from the ONNX model, in the order defined by
-
List[Tensor]–the model's output specification.
Raises:
-
ModelRuntimeError–If input tensors have different batch sizes.
Examples:
Run inference with batch size limit:
>>> from inference_models.developer_tools import run_onnx_session_with_batch_size_limit
>>> import onnxruntime as ort
>>> import torch
>>>
>>> session = ort.InferenceSession("model.onnx")
>>>
>>> # Large batch that exceeds model's max batch size
>>> inputs = {
... "input": torch.randn(100, 3, 640, 640, device="cuda")
... }
>>>
>>> # Process in chunks of 16
>>> outputs = run_onnx_session_with_batch_size_limit(
... session=session,
... inputs=inputs,
... max_batch_size=16
... )
>>> # Returns concatenated results from all chunks
Handle models with static batch size:
>>> # Model requires exactly batch size of 8
>>> inputs = {"input": torch.randn(20, 3, 640, 640, device="cuda")}
>>>
>>> outputs = run_onnx_session_with_batch_size_limit(
... session=session,
... inputs=inputs,
... max_batch_size=8,
... min_batch_size=8 # Pad last chunk to size 8
... )
Note
- Automatically handles batch splitting and result concatenation
- Pads the last chunk if min_batch_size is specified
- Uses
run_onnx_session_via_iobinding()internally for efficiency - All input tensors must have the same batch size
See Also
run_onnx_session_via_iobinding(): Lower-level ONNX executiongenerate_batch_chunks(): Utility for creating batch chunks