Added PredGatherIGemm sparse conv backend. by blackencino · Pull Request #508 · openvdb/fvdb-core

blackencino · 2026-03-05T07:40:45Z

PredGatherIGemm: Alternative Sparse Convolution Backend

Summary

This PR adds a new sparse convolution backend -- PredGatherIGemm -- that uses
CUTLASS/CuTe implicit-GEMM (IGEMM) with predicated cp.async gather loads on
SM80+ (Ampere and later) GPUs. It processes one output NanoVDB leaf node per CTA,
using TF32 tensor-core arithmetic for the computation.

The backend is integrated into the ConvolutionPlan framework as a selectable
backend (expert_config={"backend": "pred_gather_igemm"}), with the existing
GatherScatterDefault backend remaining the default.

Constraints

The PredGatherIGemm backend is intentionally limited in scope compared to the
default GatherScatterDefault backend:

CUDA only, requires SM80+ (Ampere or later)
Float32 only (internally promoted to TF32)
Forward pass only -- no transpose, no analytical backward (backward falls
back to GatherScatterDefault when used via autograd)
Uniform kernel sizes only: 3, 5, or 7 (x=y=z)
Uniform strides only: 1 or 2 (x=y=z)
Channel counts must be multiples of 32
Batch size 1 only

Kernel size and stride are dispatched at compile time using the project's
dispatch framework, giving 6 total template instantiations.

Performance Characteristics

Benchmarked on SM120 with Cin=64, Cout=128, kernel 3x3x3, stride 1:

Scenario	PredGatherIGemm	GS + topology	GS (topology cached)
1M dense (75% leaf occ)	5.2 ms	45.8 ms	31.2 ms
2M dense (75% leaf occ)	10.2 ms	89.1 ms	64.9 ms
4M sparse (25% leaf occ)	32.0 ms	21.1 ms	14.8 ms
8M sparse (10% leaf occ)	43.6 ms	8.1 ms	5.0 ms

The IGEMM backend is significantly faster for dense or near-dense grids
(high leaf-node occupancy), where its one-leaf-per-CTA approach keeps the GPU
fully occupied. At low occupancy the per-CTA work becomes sparse and the
GatherScatterDefault backend -- which operates on compacted index pairs -- wins
decisively.

Files Changed

New files

src/fvdb/detail/ops/convolution/PredGatherIGemm.h -- public header
src/fvdb/detail/ops/convolution/PredGatherIGemm.cu -- CUTLASS IGEMM kernel,
CuTe layouts, dispatch table, and entry point
src/tests/PredGatherIGemmTest.cu -- C++ gtests: correctness validation
against GatherScatterDefault across all 6 kernel/stride combinations, plus
speed comparison benchmarks
tests/unit/test_conv_pred_gather_igemm.py -- Python tests: forward-pass
validation against dense PyTorch conv3d ground truth and cross-backend
comparison with GatherScatterDefault

Modified files

src/fvdb/GridBatch.h / src/fvdb/GridBatch.cpp -- added static
predGatherIGemmConv method
src/python/Bindings.cpp -- pybind11 binding for pred_gather_igemm_conv
fvdb/_fvdb_cpp.pyi -- type stub for the new binding
fvdb/convolution_plan.py -- _PredGatherIGemmBackend, autograd wrapper
(_PredGatherIGemmConvFn), backend selection logic in _build_backend
src/CMakeLists.txt / src/tests/CMakeLists.txt -- added new source and test
files to the build

Test Plan

ninja PredGatherIGemmTest && ./src/tests/PredGatherIGemmTest --
runs the C++ gtest suite (correctness + benchmarks)
python -m pytest tests/unit/test_conv_pred_gather_igemm.py -v --
runs the Python test suite (forward-only, TF32-tolerant comparisons against
dense ground truth and GatherScatterDefault)

Signed-off-by: Christopher Horvath <chorvath@nvidia.com>

src/fvdb/detail/ops/convolution/PredGatherIGemm.cu

sifakis

The integration of the core IGEMM code looks very clean. I left a few comments on possible optimizations (which I might give a try myself, too).

src/fvdb/detail/ops/convolution/PredGatherIGemm.cu

tests/unit/test_conv_pred_gather_igemm.py

…onv for ChanOuts that are divisible by 128 Signed-off-by: Christopher Horvath <chorvath@nvidia.com>

blackencino · 2026-03-06T17:54:54Z

I fixed all of the comments from @sifakis and with the TK=128 specialization, it is MUCH faster, almost 2x in most cases.

blackencino · 2026-03-06T18:10:16Z

TK421, why aren't you at your post? TK421!?!

sifakis

This looks great @blackencino!

# PredGatherIGemm: Alternative Sparse Convolution Backend ## Summary This PR adds a new sparse convolution backend -- **PredGatherIGemm** -- that uses CUTLASS/CuTe implicit-GEMM (IGEMM) with predicated `cp.async` gather loads on SM80+ (Ampere and later) GPUs. It processes one output NanoVDB leaf node per CTA, using TF32 tensor-core arithmetic for the computation. The backend is integrated into the `ConvolutionPlan` framework as a selectable backend (`expert_config={"backend": "pred_gather_igemm"}`), with the existing GatherScatterDefault backend remaining the default. ## Constraints The PredGatherIGemm backend is intentionally limited in scope compared to the default GatherScatterDefault backend: - **CUDA only**, requires SM80+ (Ampere or later) - **Float32 only** (internally promoted to TF32) - **Forward pass only** -- no transpose, no analytical backward (backward falls back to GatherScatterDefault when used via autograd) - **Uniform kernel sizes** only: 3, 5, or 7 (x=y=z) - **Uniform strides** only: 1 or 2 (x=y=z) - **Channel counts** must be multiples of 32 - **Batch size 1** only Kernel size and stride are dispatched at compile time using the project's `dispatch` framework, giving 6 total template instantiations. ## Performance Characteristics Benchmarked on SM120 with Cin=64, Cout=128, kernel 3x3x3, stride 1: | Scenario | PredGatherIGemm | GS + topology | GS (topology cached) | |---|---|---|---| | 1M dense (75% leaf occ) | **5.2 ms** | 45.8 ms | 31.2 ms | | 2M dense (75% leaf occ) | **10.2 ms** | 89.1 ms | 64.9 ms | | 4M sparse (25% leaf occ) | 32.0 ms | 21.1 ms | **14.8 ms** | | 8M sparse (10% leaf occ) | 43.6 ms | 8.1 ms | **5.0 ms** | The IGEMM backend is significantly faster for **dense or near-dense** grids (high leaf-node occupancy), where its one-leaf-per-CTA approach keeps the GPU fully occupied. At low occupancy the per-CTA work becomes sparse and the GatherScatterDefault backend -- which operates on compacted index pairs -- wins decisively. ## Files Changed ### New files - `src/fvdb/detail/ops/convolution/PredGatherIGemm.h` -- public header - `src/fvdb/detail/ops/convolution/PredGatherIGemm.cu` -- CUTLASS IGEMM kernel, CuTe layouts, dispatch table, and entry point - `src/tests/PredGatherIGemmTest.cu` -- C++ gtests: correctness validation against GatherScatterDefault across all 6 kernel/stride combinations, plus speed comparison benchmarks - `tests/unit/test_conv_pred_gather_igemm.py` -- Python tests: forward-pass validation against dense PyTorch conv3d ground truth and cross-backend comparison with GatherScatterDefault ### Modified files - `src/fvdb/GridBatch.h` / `src/fvdb/GridBatch.cpp` -- added static `predGatherIGemmConv` method - `src/python/Bindings.cpp` -- pybind11 binding for `pred_gather_igemm_conv` - `fvdb/_fvdb_cpp.pyi` -- type stub for the new binding - `fvdb/convolution_plan.py` -- `_PredGatherIGemmBackend`, autograd wrapper (`_PredGatherIGemmConvFn`), backend selection logic in `_build_backend` - `src/CMakeLists.txt` / `src/tests/CMakeLists.txt` -- added new source and test files to the build ## Test Plan - `ninja PredGatherIGemmTest && ./src/tests/PredGatherIGemmTest` -- runs the C++ gtest suite (correctness + benchmarks) - `python -m pytest tests/unit/test_conv_pred_gather_igemm.py -v` -- runs the Python test suite (forward-only, TF32-tolerant comparisons against dense ground truth and GatherScatterDefault) --------- Signed-off-by: Christopher Horvath <chorvath@nvidia.com>

blackencino added 3 commits March 4, 2026 21:19

Merged in main, got up to date with this stale branch

6f2abe2

Signed-off-by: Christopher Horvath <chorvath@nvidia.com>

Added first sifakis ref conv backend

b65f724

Signed-off-by: Christopher Horvath <chorvath@nvidia.com>

Added in new Efty sparse conv backend

62fcbb7

Signed-off-by: Christopher Horvath <chorvath@nvidia.com>

blackencino requested a review from a team as a code owner March 5, 2026 07:40

blackencino requested review from fwilliams and harrism March 5, 2026 07:40

Python reformatting

327ca4e

Signed-off-by: Christopher Horvath <chorvath@nvidia.com>

blackencino requested a review from sifakis March 5, 2026 07:49

blackencino added 4 commits March 5, 2026 11:47

Added PredGatherIGemm as a formal backend, porting sifakis_ref_2

23fdc5e

Signed-off-by: Christopher Horvath <chorvath@nvidia.com>

Major cleanup of alternative conv backends

175648b

Signed-off-by: Christopher Horvath <chorvath@nvidia.com>

Python formatting fixes

fc98cb0

Signed-off-by: Christopher Horvath <chorvath@nvidia.com>

Added pytests for PredGatherIGemm

0621fcd

Signed-off-by: Christopher Horvath <chorvath@nvidia.com>

blackencino self-assigned this Mar 5, 2026

Fixed python formatting

9c9614c

Signed-off-by: Christopher Horvath <chorvath@nvidia.com>

blackencino changed the title ~~Feature/conv alternative backends~~ Added PredGatherIGemm sparse conv backend. Mar 5, 2026

sifakis reviewed Mar 6, 2026

View reviewed changes

src/fvdb/detail/ops/convolution/PredGatherIGemm.cu Outdated Show resolved Hide resolved

sifakis reviewed Mar 6, 2026

View reviewed changes

Fixed the buffer copy for the 1-offset in NanoVDB. Also specialized c…

d03914f

…onv for ChanOuts that are divisible by 128 Signed-off-by: Christopher Horvath <chorvath@nvidia.com>

sifakis approved these changes Mar 6, 2026

View reviewed changes

blackencino enabled auto-merge (squash) March 6, 2026 18:22

fwilliams changed the base branch from main to v0.4 March 6, 2026 18:26

blackencino merged commit a0a74cd into openvdb:v0.4 Mar 6, 2026
33 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added PredGatherIGemm sparse conv backend.#508

Added PredGatherIGemm sparse conv backend.#508
blackencino merged 10 commits intoopenvdb:v0.4from
blackencino:feature/conv_alternative_backends

blackencino commented Mar 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

sifakis left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

blackencino commented Mar 6, 2026

Uh oh!

blackencino commented Mar 6, 2026

Uh oh!

sifakis left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

blackencino commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PredGatherIGemm: Alternative Sparse Convolution Backend

Summary

Constraints

Performance Characteristics

Files Changed

New files

Modified files

Test Plan

Uh oh!

Uh oh!

sifakis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

blackencino commented Mar 6, 2026

Uh oh!

blackencino commented Mar 6, 2026

Uh oh!

sifakis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

blackencino commented Mar 5, 2026 •

edited

Loading