Skip to content

Added PredGatherIGemm sparse conv backend.#508

Merged
blackencino merged 10 commits intoopenvdb:v0.4from
blackencino:feature/conv_alternative_backends
Mar 6, 2026
Merged

Added PredGatherIGemm sparse conv backend.#508
blackencino merged 10 commits intoopenvdb:v0.4from
blackencino:feature/conv_alternative_backends

Conversation

@blackencino
Copy link
Contributor

@blackencino blackencino commented Mar 5, 2026

PredGatherIGemm: Alternative Sparse Convolution Backend

Summary

This PR adds a new sparse convolution backend -- PredGatherIGemm -- that uses
CUTLASS/CuTe implicit-GEMM (IGEMM) with predicated cp.async gather loads on
SM80+ (Ampere and later) GPUs. It processes one output NanoVDB leaf node per CTA,
using TF32 tensor-core arithmetic for the computation.

The backend is integrated into the ConvolutionPlan framework as a selectable
backend (expert_config={"backend": "pred_gather_igemm"}), with the existing
GatherScatterDefault backend remaining the default.

Constraints

The PredGatherIGemm backend is intentionally limited in scope compared to the
default GatherScatterDefault backend:

  • CUDA only, requires SM80+ (Ampere or later)
  • Float32 only (internally promoted to TF32)
  • Forward pass only -- no transpose, no analytical backward (backward falls
    back to GatherScatterDefault when used via autograd)
  • Uniform kernel sizes only: 3, 5, or 7 (x=y=z)
  • Uniform strides only: 1 or 2 (x=y=z)
  • Channel counts must be multiples of 32
  • Batch size 1 only

Kernel size and stride are dispatched at compile time using the project's
dispatch framework, giving 6 total template instantiations.

Performance Characteristics

Benchmarked on SM120 with Cin=64, Cout=128, kernel 3x3x3, stride 1:

Scenario PredGatherIGemm GS + topology GS (topology cached)
1M dense (75% leaf occ) 5.2 ms 45.8 ms 31.2 ms
2M dense (75% leaf occ) 10.2 ms 89.1 ms 64.9 ms
4M sparse (25% leaf occ) 32.0 ms 21.1 ms 14.8 ms
8M sparse (10% leaf occ) 43.6 ms 8.1 ms 5.0 ms

The IGEMM backend is significantly faster for dense or near-dense grids
(high leaf-node occupancy), where its one-leaf-per-CTA approach keeps the GPU
fully occupied. At low occupancy the per-CTA work becomes sparse and the
GatherScatterDefault backend -- which operates on compacted index pairs -- wins
decisively.

Files Changed

New files

  • src/fvdb/detail/ops/convolution/PredGatherIGemm.h -- public header
  • src/fvdb/detail/ops/convolution/PredGatherIGemm.cu -- CUTLASS IGEMM kernel,
    CuTe layouts, dispatch table, and entry point
  • src/tests/PredGatherIGemmTest.cu -- C++ gtests: correctness validation
    against GatherScatterDefault across all 6 kernel/stride combinations, plus
    speed comparison benchmarks
  • tests/unit/test_conv_pred_gather_igemm.py -- Python tests: forward-pass
    validation against dense PyTorch conv3d ground truth and cross-backend
    comparison with GatherScatterDefault

Modified files

  • src/fvdb/GridBatch.h / src/fvdb/GridBatch.cpp -- added static
    predGatherIGemmConv method
  • src/python/Bindings.cpp -- pybind11 binding for pred_gather_igemm_conv
  • fvdb/_fvdb_cpp.pyi -- type stub for the new binding
  • fvdb/convolution_plan.py -- _PredGatherIGemmBackend, autograd wrapper
    (_PredGatherIGemmConvFn), backend selection logic in _build_backend
  • src/CMakeLists.txt / src/tests/CMakeLists.txt -- added new source and test
    files to the build

Test Plan

  • ninja PredGatherIGemmTest && ./src/tests/PredGatherIGemmTest --
    runs the C++ gtest suite (correctness + benchmarks)
  • python -m pytest tests/unit/test_conv_pred_gather_igemm.py -v --
    runs the Python test suite (forward-only, TF32-tolerant comparisons against
    dense ground truth and GatherScatterDefault)

Signed-off-by: Christopher Horvath <chorvath@nvidia.com>
Signed-off-by: Christopher Horvath <chorvath@nvidia.com>
Signed-off-by: Christopher Horvath <chorvath@nvidia.com>
@blackencino blackencino requested a review from a team as a code owner March 5, 2026 07:40
@blackencino blackencino requested review from fwilliams and harrism March 5, 2026 07:40
Signed-off-by: Christopher Horvath <chorvath@nvidia.com>
@blackencino blackencino requested a review from sifakis March 5, 2026 07:49
Signed-off-by: Christopher Horvath <chorvath@nvidia.com>
Signed-off-by: Christopher Horvath <chorvath@nvidia.com>
Signed-off-by: Christopher Horvath <chorvath@nvidia.com>
Signed-off-by: Christopher Horvath <chorvath@nvidia.com>
@blackencino blackencino self-assigned this Mar 5, 2026
Signed-off-by: Christopher Horvath <chorvath@nvidia.com>
@blackencino blackencino changed the title Feature/conv alternative backends Added PredGatherIGemm sparse conv backend. Mar 5, 2026
Copy link

@sifakis sifakis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integration of the core IGEMM code looks very clean. I left a few comments on possible optimizations (which I might give a try myself, too).

…onv for ChanOuts that are divisible by 128

Signed-off-by: Christopher Horvath <chorvath@nvidia.com>
@blackencino
Copy link
Contributor Author

I fixed all of the comments from @sifakis and with the TK=128 specialization, it is MUCH faster, almost 2x in most cases.

@blackencino
Copy link
Contributor Author

TK421, why aren't you at your post? TK421!?!

Copy link

@sifakis sifakis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @blackencino!

@blackencino blackencino enabled auto-merge (squash) March 6, 2026 18:22
@fwilliams fwilliams changed the base branch from main to v0.4 March 6, 2026 18:26
@blackencino blackencino merged commit a0a74cd into openvdb:v0.4 Mar 6, 2026
33 checks passed
fwilliams pushed a commit that referenced this pull request Mar 6, 2026
# PredGatherIGemm: Alternative Sparse Convolution Backend

## Summary

This PR adds a new sparse convolution backend -- **PredGatherIGemm** --
that uses
CUTLASS/CuTe implicit-GEMM (IGEMM) with predicated `cp.async` gather
loads on
SM80+ (Ampere and later) GPUs. It processes one output NanoVDB leaf node
per CTA,
using TF32 tensor-core arithmetic for the computation.

The backend is integrated into the `ConvolutionPlan` framework as a
selectable
backend (`expert_config={"backend": "pred_gather_igemm"}`), with the
existing
GatherScatterDefault backend remaining the default.

## Constraints

The PredGatherIGemm backend is intentionally limited in scope compared
to the
default GatherScatterDefault backend:

- **CUDA only**, requires SM80+ (Ampere or later)
- **Float32 only** (internally promoted to TF32)
- **Forward pass only** -- no transpose, no analytical backward
(backward falls
  back to GatherScatterDefault when used via autograd)
- **Uniform kernel sizes** only: 3, 5, or 7 (x=y=z)
- **Uniform strides** only: 1 or 2 (x=y=z)
- **Channel counts** must be multiples of 32
- **Batch size 1** only

Kernel size and stride are dispatched at compile time using the
project's
`dispatch` framework, giving 6 total template instantiations.

## Performance Characteristics

Benchmarked on SM120 with Cin=64, Cout=128, kernel 3x3x3, stride 1:

| Scenario | PredGatherIGemm | GS + topology | GS (topology cached) |
|---|---|---|---|
| 1M dense (75% leaf occ) | **5.2 ms** | 45.8 ms | 31.2 ms |
| 2M dense (75% leaf occ) | **10.2 ms** | 89.1 ms | 64.9 ms |
| 4M sparse (25% leaf occ) | 32.0 ms | 21.1 ms | **14.8 ms** |
| 8M sparse (10% leaf occ) | 43.6 ms | 8.1 ms | **5.0 ms** |

The IGEMM backend is significantly faster for **dense or near-dense**
grids
(high leaf-node occupancy), where its one-leaf-per-CTA approach keeps
the GPU
fully occupied. At low occupancy the per-CTA work becomes sparse and the
GatherScatterDefault backend -- which operates on compacted index pairs
-- wins
decisively.

## Files Changed

### New files

- `src/fvdb/detail/ops/convolution/PredGatherIGemm.h` -- public header
- `src/fvdb/detail/ops/convolution/PredGatherIGemm.cu` -- CUTLASS IGEMM
kernel,
  CuTe layouts, dispatch table, and entry point
- `src/tests/PredGatherIGemmTest.cu` -- C++ gtests: correctness
validation
against GatherScatterDefault across all 6 kernel/stride combinations,
plus
  speed comparison benchmarks
- `tests/unit/test_conv_pred_gather_igemm.py` -- Python tests:
forward-pass
  validation against dense PyTorch conv3d ground truth and cross-backend
  comparison with GatherScatterDefault

### Modified files

- `src/fvdb/GridBatch.h` / `src/fvdb/GridBatch.cpp` -- added static
  `predGatherIGemmConv` method
- `src/python/Bindings.cpp` -- pybind11 binding for
`pred_gather_igemm_conv`
- `fvdb/_fvdb_cpp.pyi` -- type stub for the new binding
- `fvdb/convolution_plan.py` -- `_PredGatherIGemmBackend`, autograd
wrapper
(`_PredGatherIGemmConvFn`), backend selection logic in `_build_backend`
- `src/CMakeLists.txt` / `src/tests/CMakeLists.txt` -- added new source
and test
  files to the build

## Test Plan

- `ninja PredGatherIGemmTest && ./src/tests/PredGatherIGemmTest` --
  runs the C++ gtest suite (correctness + benchmarks)
- `python -m pytest tests/unit/test_conv_pred_gather_igemm.py -v` --
runs the Python test suite (forward-only, TF32-tolerant comparisons
against
  dense ground truth and GatherScatterDefault)

---------

Signed-off-by: Christopher Horvath <chorvath@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants