Skip to content

Conversation

@Menkib64
Copy link
Contributor

@Menkib64 Menkib64 commented Oct 5, 2025

No description provided.

@borg323
Copy link
Member

borg323 commented Oct 8, 2025

While network_onnx.cc and the related cuda files should eventually move to src/network/backends/onnx, I prefer to leave that for a followup to avoid complicating ongoing development.

Very small batches require a separate optimisation. It costs too much
performance for small sizes if optimising the batch sizes 1. Adding
special optimisation for very small batches won't a simple change which
should be left for future change.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to somehow have the CUDA part isolated from the pure ONNX, but I don't immediately see a good way to do it. I'll give it a thought.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#2328 tries to separate EP specific code using a provider template type. It could be probably improved.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request adds CUDA runtime optimizations for the ONNX backend, improving performance for CUDA and TensorRT execution providers. The changes introduce custom CUDA kernels for input plane expansion, implement resource pooling for InputsOutputs objects, and add proper stream synchronization for asynchronous GPU operations.

  • Implements custom CUDA kernels for expanding input planes on GPU instead of on CPU
  • Adds resource pooling mechanism to reuse InputsOutputs allocations across computations
  • Introduces proper CUDA stream management with separate streams for upload, compute, and download operations

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 13 comments.

File Description
src/neural/backends/network_onnx.cc Main implementation changes including InputsOutputs struct, CUDA stream management, resource pooling, and async GPU operations
src/neural/backends/cuda/onnx_kernels.h Header declaring the expandPlanesOnnx CUDA kernel template
src/neural/backends/cuda/onnx.cu CUDA kernel implementation for plane expansion
meson.build Build configuration to enable CUDA runtime support for ONNX backend
Comments suppressed due to low confidence (1)

src/neural/backends/network_onnx.cc:1

  • Missing parentheses around the logical OR expression. This condition will be parsed as (get_option('cudnn')) or (get_option('plain_cuda') and cu_dart.found() and nvcc.found()) due to operator precedence. It should be (get_option('cudnn') or get_option('plain_cuda')) and cu_dart.found() and nvcc.found() to ensure both conditions require CUDA runtime and nvcc.
/*

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +684 to +686
(batch_size < 0 ? std::to_string(batch_size)
: std::to_string(batch_size - batch_size_ + 1) + "-" +
std::to_string(batch_size)) +
Copy link

Copilot AI Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ternary expression is confusing. When batch_size < 0, it just converts it to string, but when positive, it creates a range string. Consider extracting this into a helper function or add a comment explaining the logic (e.g., negative means variable batch size, positive means fixed range).

Copilot uses AI. Check for mistakes.
@borg323 borg323 merged commit 3e49976 into LeelaChessZero:master Nov 4, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants