-
Notifications
You must be signed in to change notification settings - Fork 597
Onnx memory management and onnx-trt optimisation #2307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
While |
Very small batches require a separate optimisation. It costs too much performance for small sizes if optimising the batch sizes 1. Adding special optimisation for very small batches won't a simple change which should be left for future change.
5b4afd0 to
bb45a9f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to somehow have the CUDA part isolated from the pure ONNX, but I don't immediately see a good way to do it. I'll give it a thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#2328 tries to separate EP specific code using a provider template type. It could be probably improved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request adds CUDA runtime optimizations for the ONNX backend, improving performance for CUDA and TensorRT execution providers. The changes introduce custom CUDA kernels for input plane expansion, implement resource pooling for InputsOutputs objects, and add proper stream synchronization for asynchronous GPU operations.
- Implements custom CUDA kernels for expanding input planes on GPU instead of on CPU
- Adds resource pooling mechanism to reuse InputsOutputs allocations across computations
- Introduces proper CUDA stream management with separate streams for upload, compute, and download operations
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 13 comments.
| File | Description |
|---|---|
| src/neural/backends/network_onnx.cc | Main implementation changes including InputsOutputs struct, CUDA stream management, resource pooling, and async GPU operations |
| src/neural/backends/cuda/onnx_kernels.h | Header declaring the expandPlanesOnnx CUDA kernel template |
| src/neural/backends/cuda/onnx.cu | CUDA kernel implementation for plane expansion |
| meson.build | Build configuration to enable CUDA runtime support for ONNX backend |
Comments suppressed due to low confidence (1)
src/neural/backends/network_onnx.cc:1
- Missing parentheses around the logical OR expression. This condition will be parsed as
(get_option('cudnn')) or (get_option('plain_cuda') and cu_dart.found() and nvcc.found())due to operator precedence. It should be(get_option('cudnn') or get_option('plain_cuda')) and cu_dart.found() and nvcc.found()to ensure both conditions require CUDA runtime and nvcc.
/*
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| (batch_size < 0 ? std::to_string(batch_size) | ||
| : std::to_string(batch_size - batch_size_ + 1) + "-" + | ||
| std::to_string(batch_size)) + |
Copilot
AI
Oct 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ternary expression is confusing. When batch_size < 0, it just converts it to string, but when positive, it creates a range string. Consider extracting this into a helper function or add a comment explaining the logic (e.g., negative means variable batch size, positive means fixed range).
No description provided.