Unify CUDA and HIP kernel sources via compat.cuh portability layer by Abdennacer-Badaoui · Pull Request #1877 · bitsandbytes-foundation/bitsandbytes

Abdennacer-Badaoui · 2026-02-18T13:53:41Z

Summary

Merge the duplicated CUDA and HIP kernel sources into a single codebase. Instead of maintaining near-identical file pairs that inevitably drift apart, all GPU code now lives in shared files that compile for both platforms.

Problem

We maintained near-identical copies of every GPU kernel:

CUDA	HIP	~LOC each
`kernels.cu`	`kernels.hip`	2600+
`kernels.cuh`	`kernels_hip.cuh`	130
`ops.cu`	`ops.hip`	650+
`ops.cuh`	`ops_hip.cuh`	190
`common.cuh`	`common_hip.cuh`	45 / 11

The HIP files were originally auto-generated by hipify and manually patched. Every bug fix or new feature had to be applied to both copies.

Approach

Two portability headers abstract all CUDA/HIP differences:

compat.cuh — Host-safe types and macros (safe to include from .cpp files)
compat_device.cuh — Device-only layer: CUB/hipCUB, reduction ops, MMA (include from .cu files only)

These provide macros, type aliases, and namespace aliases:

bnb_cub:: → cub:: on CUDA, hipcub:: on HIP
bnb_bfloat16 → __nv_bfloat16 on CUDA, hip_bfloat16 on HIP
bnb_stream_t → cudaStream_t / hipStream_t
BNB_MAX_OP → cub::Max() / hipcub::Max()
BNB_CHECK_RETURN() → error checking
bnb_blasLt*, bnb_sparse* → cuBLAS/hipBLAS and cuSPARSE/hipSPARSE

The <<<grid, block>>> launch syntax works natively on HIP, so no hipLaunchKernelGGL wrappers are needed. For HIP builds, CMake sets LANGUAGE HIP on the .cu files.

#if BNB_HIP guards are only needed for genuinely divergent code (~10% of changes):

atomicMax (CUDA needs CAS loop, HIP has native)
Context class (cuBLAS vs rocBLAS handle creation)
igemmlt (hipBLAS requires explicit heuristic algo selection)
Warp-size-dependent kernels (via BNB_WARP_SIZE compile-time constants)

The two-header split is necessary because .cpp files (like pythonInterface.cpp) are compiled by the host compiler (gcc/g++), which cannot parse CUB/device headers.

Changes

Deleted 5 files: common_hip.cuh, kernels.hip, kernels_hip.cuh, ops.hip, ops_hip.cuh
Unified 5 files: common.cuh, kernels.cu, kernels.cuh, ops.cu, ops.cuh
Added 2 files: compat.cuh, compat_device.cuh
Updated: CMakeLists.txt, pythonInterface.cpp
Net result: 10 files → 7 files

github-actions · 2026-02-19T14:31:01Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

TimDettmers

PR Review: [Discussion] Unify CUDA and HIP kernel sources via compat.cuh portability layer

Classification: Refactoring / RFC (discussion-only, not intended to merge as-is)
Author: @Abdennacer-Badaoui (known contributor — authored the blocksize-32/64 kernels in #1854/#1856 that this unifies)
Risk level: Low (all files are additions in csrc/examples/, no existing code is modified)

Summary

This PR proposes a design for merging the duplicated CUDA and HIP kernel source files into a unified codebase using two new portability headers: compat.cuh (host-safe) and compat_device.cuh (device-only). The current codebase maintains near-identical copies of 5 pairs of files (~6500 LOC of duplication). The proposed approach would eliminate 5 files and ~3300 lines of duplication while introducing 2 new portability headers.

The 8 example files demonstrate the full approach. This is a well-structured RFC that shows rather than tells.

CI Status

Lint: FAIL (expected — clang-format likely hasn't been run on the new files)
build-wheels: FAIL (unrelated — dependency on lint)
All CUDA/HIP/CPU build & test jobs: PASS (these don't compile csrc/examples/)

The lint failure is expected for an RFC and is not a concern at this stage.

Design Assessment

The two-header split (compat.cuh for host-safe code, compat_device.cuh for device-only CUB/MMA) is a clean design. The rationale is solid: .cpp files compiled by gcc/g++ cannot parse CUDA device headers, so the split is necessary.

Strengths:

Namespace aliasing for CUB/hipCUB (namespace bnb_cub = cub/hipcub) eliminates ~90% of the mechanical cub:: vs hipcub:: differences with a single line. Elegant.
Compile-time BNB_WARP_SIZE in common_unified.cuh correctly handles the GFX9 (CDNA) 64-wide warps vs RDNA/CUDA 32-wide warps. The #ifdef __GFX9__ guard is correct for current ROCm architectures.
kQuantizeBlockwiseSmall successfully unifies kQuantizeBlockwise32 (CUDA) and kQuantizeBlockwise64 (HIP) by parameterizing on BNB_WARP_SIZE. The kernel logic is structurally identical to both originals — I verified the codebook values, reduction ops, quantization packing, and store patterns match.
#if BNB_HIP guards are used sparingly and only where genuinely needed:
- atomicMax (CUDA CAS loop vs HIP native)
- Context class (cuBLAS vs rocBLAS handle creation)
- gemmex/strided_gemmex (different BLAS APIs)
- igemmlt (hipBLAS requires explicit heuristic algo selection)
- blocksize==64 dispatch path in ops_unified.cu (only HIP with 64-wide warps needs the small-block kernel for blocksize=64)
CMakeLists change is minimal and correct: single GPU_FILES list replaces separate CUDA_FILES/HIP_FILES, with set_source_files_properties(${GPU_FILES} PROPERTIES LANGUAGE HIP) for HIP builds. The <<<>>> launch syntax works natively on HIP, so no hipLaunchKernelGGL wrappers are needed.

Technical concerns (for discussion):

BNB_WARP_SIZE and blocksize=64 dispatch: In ops_unified.cu lines 50-61, the blocksize==64 path has a #if BNB_HIP guard to dispatch to kQuantizeBlockwiseSmall for 4-bit types on HIP. However, BNB_WARP_SIZE is a device-side macro (__GFX9__ is only defined in device code), while this dispatch decision is made in host code. How will the host-side code know whether to use the warp-64 path? The current approach uses #if BNB_HIP as a proxy, which is correct if the library is compiled separately for each target architecture, but could be wrong if a single HIP binary targets both CDNA (warp64) and RDNA (warp32) architectures simultaneously. This probably needs a runtime check or separate compilation for each arch, or a comment explaining the assumption.
kQuantizeBlockwiseSmall name: The kernel is called "Small" but on HIP with warp=64, it handles blocksize=64 (not small at all). Consider kQuantizeBlockwiseWarp or similar to reflect that it processes warp-sized blocks. Minor naming nit.
compat.cuh includes rocblas/rocblas.h and hipblas/hipblas.h unconditionally on HIP: These are heavyweight headers. If compat.cuh is meant to be "host-safe and lightweight," consider whether these BLAS includes belong here or in a separate BLAS compat header. Currently the Context class in ops_unified.cuh needs them, but other files including compat.cuh may not.
BNB_BLASLT_PTR_MODE_ALPHA_VEC asymmetry: On CUDA this maps to CUBLASLT_POINTER_MODE_ALPHA_DEVICE_VECTOR_BETA_ZERO, on HIP to HIPBLASLT_POINTER_MODE_ALPHA_DEVICE_VECTOR_BETA_HOST. The BETA_ZERO vs BETA_HOST difference is notable — is this an intentional difference in how the two backends handle beta, or should it be BETA_ZERO on both? This discrepancy exists in the current code, so it's not introduced by this PR, but the unification is a good opportunity to document why.
Missing bnb_blasLtPrefCreate/bnb_blasLtPrefSetAttr/bnb_blasLtAlgoGetHeuristic macros for CUDA: These are defined for HIP in compat.cuh but not for CUDA, because CUDA doesn't need the heuristic path. However, they're used inside a #if BNB_HIP block in ops_unified.cu, so there's no build failure — but it means the compat header is incomplete if someone tried to use these macros on CUDA. Add a comment or #ifdef guard noting these are HIP-only.
CUDA_CHECK_RETURN backward compat macro: Good that compat.cuh defines #define CUDA_CHECK_RETURN(value) BNB_CHECK_RETURN(value) for migration purposes. This should be documented as deprecated and removed after the full migration.

Security Review

No network access, command execution, or dynamic code execution introduced
No new dependencies added
No changes to pyproject.toml, CI workflows, or agent configuration files
No invisible Unicode characters detected in any file
Codebook values (FP4 and NF4 lookup tables) are byte-identical to the existing kernels.cu
CMakeLists changes are limited to file list unification — no new execute_process, FetchContent, or custom commands
Build flags unchanged

No security concerns.

Numerical Correctness

All quantization/dequantization kernel code is mechanically equivalent to the existing CUDA and HIP kernels. Specifically verified:

fp4_dequantization_lut and nf4_dequantization_lut values are identical
dQuantizeFP4, dQuantizeNF4, dDequantizeFP4Tree, dDequantizeNF4 logic is identical
atomicMax CAS loop is correctly guarded with #if !BNB_HIP
kQuantizeBlockwise template uses bnb_cub:: and BNB_MAX_OP as 1:1 replacements
kQuantizeBlockwiseSmall logic matches both kQuantizeBlockwise32 (CUDA) and kQuantizeBlockwise64 (HIP)
igemmlt preserves the HIP heuristic path and CUDA direct path

No numerical correctness concerns.

Downstream Impact

None. This PR adds files to csrc/examples/ — it does not modify any compiled source, public API, or serialization format. No downstream impact.

Cross-PR Conflicts

PR #1858 (k-bit blockwise quantization kernels) adds new CUDA kernels. If this RFC proceeds to full migration, the new kernels from #1858 would need to be written using the compat.cuh abstractions rather than raw CUDA APIs. Worth noting for sequencing.

Verdict: APPROVE (as RFC)

This is a well-designed RFC. The portability layer approach is sound, the #if BNB_HIP guards are minimal and limited to genuinely divergent code, and the unified kernel code is a faithful merge of the existing CUDA and HIP sources. The concerns listed above are discussion points for the design, not blockers.

For the full migration, I'd recommend:

Resolve the warp-size host/device detection question (concern #1 above)
Add compilation tests that verify the unified files build correctly for both CUDA and HIP
Run the full test suite on both CUDA and ROCm hardware to verify numerical equivalence
Sequence this after or coordinate with #1858 to avoid rework

matthewdouglas · 2026-02-20T15:19:17Z

@Abdennacer-Badaoui Thanks! This is essentially what I was expecting we could do. I think this is a good way forward. Most of the review comments above make sense as well!

Move the unified portability-header-based files from csrc/examples/ into csrc/, replacing the duplicated CUDA and HIP kernel files. - Add compat.cuh and compat_device.cuh (portability headers) - Replace common.cuh, kernels.cu, ops.cu, ops.cuh, pythonInterface.cpp, CMakeLists.txt with unified versions - Update kernels.cuh: rename kQuantizeBlockwise32 -> kQuantizeBlockwiseSmall - Delete HIP-only files: common_hip.cuh, kernels.hip, kernels_hip.cuh, ops.hip, ops_hip.cuh - Delete csrc/examples/ (files are now in their final locations) Net: 10 source files -> 7, ~3300 fewer lines of duplicated code. Same .cu files compiled by both nvcc (CUDA) and hipcc (HIP). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add hip/hip_bfloat16.h include to compat.cuh (bnb_bfloat16 type alias requires hip_bfloat16 to be defined) - Add __syncwarp() no-op macro for HIP (AMD warps are always in lockstep) - Add hipblas version check (#if hipblasVersionMajor >= 3) for GemmEx calls (ROCm 6.1 ships hipblas v2 which uses HIPBLAS_R_* not HIPBLAS_COMPUTE_*) - Fix include in ops.cuh: common.h -> common.cuh (BNB_WARP_SIZE visibility) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

common.h defines General8bit, FP4, NF4 enum values used in template instantiations. It was previously the only include; now include both common.h (for DataType_t) and common.cuh (for BNB_WARP_SIZE). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

On AMD CDNA GPUs (warp size 64), blocksize=64 would mean only 1 thread per warp in the quantize kernels, which is incompatible. Wrap these instantiations with #if BNB_WARP_SIZE == 32 so they only compile on NVIDIA. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The previous commit missed the float/NF4 and all bnb_bfloat16 blocksize=64 instantiations. These use BLOCK_LOAD_WARP_TRANSPOSE with 32 threads (64/2), which requires block_dim >= warp_size. On CDNA (warp=64), 32 threads is insufficient. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

BLOCK_LOAD_WARP_TRANSPOSE requires threads >= warp_size. On CDNA (warp=64), kQuantizeBlockwise with BLOCK_SIZE=64 has only 32 threads. Fall back to BLOCK_LOAD_DIRECT / BLOCK_STORE_DIRECT when threads < BNB_WARP_SIZE. This avoids rocprim compilation errors while keeping WARP_TRANSPOSE for larger block sizes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

__GFX9__ is only defined during the device compilation pass, not during host compilation. This caused BNB_WARP_SIZE to be 32 on the host pass even for gfx942 (CDNA, warp=64), making the conditional WARP_TRANSPOSE vs DIRECT selection wrong. Use __AMDGCN_WAVEFRONT_SIZE instead, which the HIP compiler defines correctly in both host and device passes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Now that kQuantizeBlockwise falls back to BLOCK_LOAD_DIRECT when threads < warp_size, the blocksize=64 instantiations compile correctly on both CUDA and HIP. The guards were causing linker errors because ops.cu still references these symbols for the General8bit dispatch path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

TimDettmers · 2026-02-22T04:54:41Z

MI300X Runtime Validation of Unified CUDA/HIP Kernels

I validated this PR's unified kernel approach on actual AMD MI300X hardware. The unified files were activated from csrc/examples/, several build issues were fixed, and the full test suite was run.

Test Environment

GPU: AMD Instinct MI300X OAM (192 GB HBM3)
ROCm: 6.1.0
PyTorch: 2.6.0+rocm6.1
Host: RunPod (EU-RO-1), Ubuntu 22.04, kernel 6.5.0
Build: cmake -DCOMPUTE_BACKEND=hip -DBNB_ROCM_ARCH="gfx942"

Test Results

3136 passed, 321 skipped, 0 failed, 24 xfailed, 738 warnings in 206.20s

Full test suite passes with zero failures.

Build Fixes Required

The unified files from csrc/examples/ needed several fixes to compile on HIP/gfx942. These are on branch unified-hip-validation:

BNB_WARP_SIZE detection (common.cuh): The original code used #ifdef __GFX9__ to set warp size to 64 for CDNA. However, __GFX9__ is only defined during the device compilation pass — the host compilation pass (which also instantiates templates) doesn't see it. Fixed by using __AMDGCN_WAVEFRONT_SIZE instead, which the HIP compiler correctly defines in both passes.
BLOCK_LOAD_WARP_TRANSPOSE with small blocks (kernels.cu): kQuantizeBlockwise with BLOCK_SIZE=64 uses only 32 threads (64/NUM_PER_TH=2). On CDNA (warp=64), BLOCK_LOAD_WARP_TRANSPOSE requires block_dim >= warp_size, so 32 threads fails. Fixed by conditionally selecting BLOCK_LOAD_DIRECT when THREADS < BNB_WARP_SIZE:
```
static constexpr auto LOAD_ALGO =
    (THREADS >= BNB_WARP_SIZE) ? bnb_cub::BLOCK_LOAD_WARP_TRANSPOSE : bnb_cub::BLOCK_LOAD_DIRECT;
```
Missing common.h include (ops.cuh): The unified ops.cuh dropped the #include "common.h" needed for DataType_t enum resolution.

Branch

All fixes are on unified-hip-validation — the commits on top of merge-cuda-hip show the full diff of what was needed to go from the example files to a working build + passing tests.

Summary

The unified kernel approach works well. The compat headers successfully abstract CUDA/HIP differences, and the conditional compilation produces correct code for MI300X. The three issues found above are all fixable within the compat header pattern — no fundamental problems with the unification design.

Abdennacer-Badaoui · 2026-02-23T09:38:40Z

I have merged @TimDettmers’ changes into my branch. The PR is ready to be merged if you’re okay with the current state of the unification. If you have any suggestions or improvements, please let me know.

# Conflicts: # csrc/kernels.cu # csrc/kernels.hip # csrc/kernels_hip.cuh # csrc/ops.cu # csrc/ops.cuh # csrc/ops.hip # csrc/ops_hip.cuh # csrc/pythonInterface.cpp

matthewdouglas · 2026-02-24T15:12:53Z

csrc/compat.cuh

+
+#if BNB_HIP
+
+#ifndef NO_HIPBLASLT


I think we'll get rid of NO_HIPBLASLT, was only defined for old ROCm versions. But we can leave in for now.

first commit

83216e0

Abdennacer-Badaoui marked this pull request as draft February 18, 2026 13:53

matthewdouglas added the ROCm label Feb 18, 2026

update

9c69888

Merge branch 'main' into merge-cuda-hip

c268706

TimDettmers reviewed Feb 19, 2026

View reviewed changes

Abdennacer-Badaoui added the RFC request for comments on proposed library improvements label Feb 20, 2026

Merge branch 'main' into merge-cuda-hip

b96843c

TimDettmers mentioned this pull request Feb 21, 2026

# 70B 4-bit LLM decode bottlenecked by HIP kernel (kgemm_4bit_inference_naive) efficiency — 49% vs 91% memory bandwidth on ROCm/gfx1151 #1842

Open

TimDettmers and others added 9 commits February 21, 2026 23:17

Apply clang-format formatting fixes

9133f46

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

merge unified-hip-validation & code cleaning

ebdda00

Abdennacer-Badaoui marked this pull request as ready for review February 23, 2026 09:19

Merge upstream/main into merge-cuda-hip

f887942

Abdennacer-Badaoui removed the RFC request for comments on proposed library improvements label Feb 23, 2026

Abdennacer-Badaoui changed the title ~~[Discussion] Unify CUDA and HIP kernel sources via compat.cuh portability layer~~ Unify CUDA and HIP kernel sources via compat.cuh portability layer Feb 23, 2026

Abdennacer-Badaoui requested review from TimDettmers, Titus-von-Koeller and matthewdouglas February 23, 2026 09:39

matthewdouglas added this to the v0.50.0 milestone Feb 23, 2026

matthewdouglas added the CUDA Issues and PRs related to the CUDA backend, excluding installation/support help. label Feb 23, 2026

Merge remote-tracking branch 'upstream/main' into merge-cuda-hip

57156ee

# Conflicts: # csrc/kernels.cu # csrc/kernels.hip # csrc/kernels_hip.cuh # csrc/ops.cu # csrc/ops.cuh # csrc/ops.hip # csrc/ops_hip.cuh # csrc/pythonInterface.cpp

matthewdouglas reviewed Feb 24, 2026

View reviewed changes

matthewdouglas approved these changes Feb 24, 2026

View reviewed changes

matthewdouglas merged commit 3934632 into bitsandbytes-foundation:main Feb 24, 2026
91 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comments

Unify CUDA and HIP kernel sources via compat.cuh portability layer#1877

Unify CUDA and HIP kernel sources via compat.cuh portability layer#1877
matthewdouglas merged 16 commits intobitsandbytes-foundation:mainfrom
Abdennacer-Badaoui:merge-cuda-hip

Abdennacer-Badaoui commented Feb 18, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 19, 2026

Uh oh!

TimDettmers left a comment

Uh oh!

matthewdouglas commented Feb 20, 2026

Uh oh!

TimDettmers commented Feb 22, 2026

Uh oh!

Abdennacer-Badaoui commented Feb 23, 2026

Uh oh!

matthewdouglas Feb 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Comments

Conversation

Abdennacer-Badaoui commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Approach

Changes

Uh oh!

github-actions bot commented Feb 19, 2026

Uh oh!

TimDettmers left a comment

Choose a reason for hiding this comment

PR Review: [Discussion] Unify CUDA and HIP kernel sources via compat.cuh portability layer

Summary

CI Status

Design Assessment

Security Review

Numerical Correctness

Downstream Impact

Cross-PR Conflicts

Verdict: APPROVE (as RFC)

Uh oh!

matthewdouglas commented Feb 20, 2026

Uh oh!

TimDettmers commented Feb 22, 2026

MI300X Runtime Validation of Unified CUDA/HIP Kernels

Test Environment

Test Results

Build Fixes Required

Branch

Summary

Uh oh!

Abdennacer-Badaoui commented Feb 23, 2026

Uh oh!

matthewdouglas Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Abdennacer-Badaoui commented Feb 18, 2026 •

edited

Loading