Added PredGatherIGemm sparse conv backend.#508
Merged
blackencino merged 10 commits intoopenvdb:v0.4from Mar 6, 2026
Merged
Conversation
Signed-off-by: Christopher Horvath <chorvath@nvidia.com>
Signed-off-by: Christopher Horvath <chorvath@nvidia.com>
Signed-off-by: Christopher Horvath <chorvath@nvidia.com>
Signed-off-by: Christopher Horvath <chorvath@nvidia.com>
Signed-off-by: Christopher Horvath <chorvath@nvidia.com>
Signed-off-by: Christopher Horvath <chorvath@nvidia.com>
Signed-off-by: Christopher Horvath <chorvath@nvidia.com>
Signed-off-by: Christopher Horvath <chorvath@nvidia.com>
Signed-off-by: Christopher Horvath <chorvath@nvidia.com>
sifakis
reviewed
Mar 6, 2026
sifakis
reviewed
Mar 6, 2026
sifakis
left a comment
There was a problem hiding this comment.
The integration of the core IGEMM code looks very clean. I left a few comments on possible optimizations (which I might give a try myself, too).
…onv for ChanOuts that are divisible by 128 Signed-off-by: Christopher Horvath <chorvath@nvidia.com>
Contributor
Author
|
I fixed all of the comments from @sifakis and with the TK=128 specialization, it is MUCH faster, almost 2x in most cases. |
Contributor
Author
|
TK421, why aren't you at your post? TK421!?! |
fwilliams
pushed a commit
that referenced
this pull request
Mar 6, 2026
# PredGatherIGemm: Alternative Sparse Convolution Backend
## Summary
This PR adds a new sparse convolution backend -- **PredGatherIGemm** --
that uses
CUTLASS/CuTe implicit-GEMM (IGEMM) with predicated `cp.async` gather
loads on
SM80+ (Ampere and later) GPUs. It processes one output NanoVDB leaf node
per CTA,
using TF32 tensor-core arithmetic for the computation.
The backend is integrated into the `ConvolutionPlan` framework as a
selectable
backend (`expert_config={"backend": "pred_gather_igemm"}`), with the
existing
GatherScatterDefault backend remaining the default.
## Constraints
The PredGatherIGemm backend is intentionally limited in scope compared
to the
default GatherScatterDefault backend:
- **CUDA only**, requires SM80+ (Ampere or later)
- **Float32 only** (internally promoted to TF32)
- **Forward pass only** -- no transpose, no analytical backward
(backward falls
back to GatherScatterDefault when used via autograd)
- **Uniform kernel sizes** only: 3, 5, or 7 (x=y=z)
- **Uniform strides** only: 1 or 2 (x=y=z)
- **Channel counts** must be multiples of 32
- **Batch size 1** only
Kernel size and stride are dispatched at compile time using the
project's
`dispatch` framework, giving 6 total template instantiations.
## Performance Characteristics
Benchmarked on SM120 with Cin=64, Cout=128, kernel 3x3x3, stride 1:
| Scenario | PredGatherIGemm | GS + topology | GS (topology cached) |
|---|---|---|---|
| 1M dense (75% leaf occ) | **5.2 ms** | 45.8 ms | 31.2 ms |
| 2M dense (75% leaf occ) | **10.2 ms** | 89.1 ms | 64.9 ms |
| 4M sparse (25% leaf occ) | 32.0 ms | 21.1 ms | **14.8 ms** |
| 8M sparse (10% leaf occ) | 43.6 ms | 8.1 ms | **5.0 ms** |
The IGEMM backend is significantly faster for **dense or near-dense**
grids
(high leaf-node occupancy), where its one-leaf-per-CTA approach keeps
the GPU
fully occupied. At low occupancy the per-CTA work becomes sparse and the
GatherScatterDefault backend -- which operates on compacted index pairs
-- wins
decisively.
## Files Changed
### New files
- `src/fvdb/detail/ops/convolution/PredGatherIGemm.h` -- public header
- `src/fvdb/detail/ops/convolution/PredGatherIGemm.cu` -- CUTLASS IGEMM
kernel,
CuTe layouts, dispatch table, and entry point
- `src/tests/PredGatherIGemmTest.cu` -- C++ gtests: correctness
validation
against GatherScatterDefault across all 6 kernel/stride combinations,
plus
speed comparison benchmarks
- `tests/unit/test_conv_pred_gather_igemm.py` -- Python tests:
forward-pass
validation against dense PyTorch conv3d ground truth and cross-backend
comparison with GatherScatterDefault
### Modified files
- `src/fvdb/GridBatch.h` / `src/fvdb/GridBatch.cpp` -- added static
`predGatherIGemmConv` method
- `src/python/Bindings.cpp` -- pybind11 binding for
`pred_gather_igemm_conv`
- `fvdb/_fvdb_cpp.pyi` -- type stub for the new binding
- `fvdb/convolution_plan.py` -- `_PredGatherIGemmBackend`, autograd
wrapper
(`_PredGatherIGemmConvFn`), backend selection logic in `_build_backend`
- `src/CMakeLists.txt` / `src/tests/CMakeLists.txt` -- added new source
and test
files to the build
## Test Plan
- `ninja PredGatherIGemmTest && ./src/tests/PredGatherIGemmTest` --
runs the C++ gtest suite (correctness + benchmarks)
- `python -m pytest tests/unit/test_conv_pred_gather_igemm.py -v` --
runs the Python test suite (forward-only, TF32-tolerant comparisons
against
dense ground truth and GatherScatterDefault)
---------
Signed-off-by: Christopher Horvath <chorvath@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PredGatherIGemm: Alternative Sparse Convolution Backend
Summary
This PR adds a new sparse convolution backend -- PredGatherIGemm -- that uses
CUTLASS/CuTe implicit-GEMM (IGEMM) with predicated
cp.asyncgather loads onSM80+ (Ampere and later) GPUs. It processes one output NanoVDB leaf node per CTA,
using TF32 tensor-core arithmetic for the computation.
The backend is integrated into the
ConvolutionPlanframework as a selectablebackend (
expert_config={"backend": "pred_gather_igemm"}), with the existingGatherScatterDefault backend remaining the default.
Constraints
The PredGatherIGemm backend is intentionally limited in scope compared to the
default GatherScatterDefault backend:
back to GatherScatterDefault when used via autograd)
Kernel size and stride are dispatched at compile time using the project's
dispatchframework, giving 6 total template instantiations.Performance Characteristics
Benchmarked on SM120 with Cin=64, Cout=128, kernel 3x3x3, stride 1:
The IGEMM backend is significantly faster for dense or near-dense grids
(high leaf-node occupancy), where its one-leaf-per-CTA approach keeps the GPU
fully occupied. At low occupancy the per-CTA work becomes sparse and the
GatherScatterDefault backend -- which operates on compacted index pairs -- wins
decisively.
Files Changed
New files
src/fvdb/detail/ops/convolution/PredGatherIGemm.h-- public headersrc/fvdb/detail/ops/convolution/PredGatherIGemm.cu-- CUTLASS IGEMM kernel,CuTe layouts, dispatch table, and entry point
src/tests/PredGatherIGemmTest.cu-- C++ gtests: correctness validationagainst GatherScatterDefault across all 6 kernel/stride combinations, plus
speed comparison benchmarks
tests/unit/test_conv_pred_gather_igemm.py-- Python tests: forward-passvalidation against dense PyTorch conv3d ground truth and cross-backend
comparison with GatherScatterDefault
Modified files
src/fvdb/GridBatch.h/src/fvdb/GridBatch.cpp-- added staticpredGatherIGemmConvmethodsrc/python/Bindings.cpp-- pybind11 binding forpred_gather_igemm_convfvdb/_fvdb_cpp.pyi-- type stub for the new bindingfvdb/convolution_plan.py--_PredGatherIGemmBackend, autograd wrapper(
_PredGatherIGemmConvFn), backend selection logic in_build_backendsrc/CMakeLists.txt/src/tests/CMakeLists.txt-- added new source and testfiles to the build
Test Plan
ninja PredGatherIGemmTest && ./src/tests/PredGatherIGemmTest--runs the C++ gtest suite (correctness + benchmarks)
python -m pytest tests/unit/test_conv_pred_gather_igemm.py -v--runs the Python test suite (forward-only, TF32-tolerant comparisons against
dense ground truth and GatherScatterDefault)