Cutensor bindings by SpaceyLake · Pull Request #38 · TAPPorg/reference-implementation

SpaceyLake · 2026-02-09T16:10:37Z

Bindings to cutensor. This adds a handle to create_tensor_info. Setters for the tensor_info are not implemented because of complications. This code also includes a version of the test that loads implementations dynamically and a version of the demo that does the same. It also includes a cutensor-specific demo.
I also removed some deprecated code on this branch.
The code that is run on CUDA doesn't get automatically tested because standard GitHub runners only use CPUs.
The code uses an attribute to allow the use of on-device memory or not.

…t memory

… api

…e_device_memory and demo-dynamic, test-dynamic

… of modes could create differences in C and D

… that doesn't work for TBLIS right now

…ementation into cutensor_bindings

… with TBLIS

evaleev

prepared by claude, edited by me

PR #38: Cutensor Bindings — Review

+8,974 / -910 across 41 files | CI: All checks pass

Summary

This PR adds cuTENSOR bindings for the TAPP API, refactors the CMake build system (pushing test/example targets into subdirectories), adds a TAPP_handle parameter to TAPP_create_tensor_info (API-breaking change), renames TAPP_REFERENCE_ENABLE_TBLIS to TAPP_REFERENCE_USE_TBLIS, adds dynamic-loading test infrastructure, and removes some deprecated files.

High-level concerns

API-breaking change to TAPP_create_tensor_info — Adding TAPP_handle as a new parameter changes the public API. The reference implementation (reference_implementation/src/tensor.c) accepts the parameter but ignores it. This is the right design (the handle is needed by cuTENSOR but not by the reference impl), but consider whether this needs a version bump or changelog entry.
Negative strides and negative_str test disabled in demo.c — The negative stride test is commented out in demo.c with a cuTENSOR-specific comment, but demo.c links against tapp::reference, not tapp::cutensor. Disabling it here penalizes the reference implementation's test coverage for a cuTENSOR limitation. Consider keeping it enabled for the reference demo and only disabling it in cuTENSOR-specific tests.
Massive code duplication: test_dynamic.cpp (4,079 lines) — This is essentially a copy-paste of test.cpp with all calls going through a struct imp function-pointer table. Same for demo_dynamic.c vs demo.c. This creates a significant maintenance burden — any future test change must be made in both places. Consider using macros or templates to share the test logic.

Specific issues

Bugs / correctness

product.cpp:952 — Wrong handle cast:
```
plan_struct->handle = ((cutensorHandle_t*) handle);
struct handle* handle_struct = (struct handle*) plan_struct->handle;
```
handle is a TAPP_handle (i.e., intptr_t) that actually points to a struct handle. First it's cast to cutensorHandle_t* and stored, then the stored cutensorHandle_t* is cast to struct handle*. This only works by accident because the cutensorHandle_t* libhandle is the first member of struct handle. This is fragile and incorrect — plan_struct->handle should be typed as struct handle* or at minimum the first cast should be (struct handle*).
attributes.cpp:575 — memcpy to/from intptr_t as pointer:
```
memcpy((void*)handle_struct->attributes[0], value, sizeof(bool));
```
attributes[0] is an intptr_t holding a bool*. The cast (void*)handle_struct->attributes[0] is correct, but the design is fragile — the intptr_t* array is a poor man's type-erased attribute store. Consider at minimum documenting the ownership model.
error.cpp:754 — Extracting TAPP field then switching on error instead of tappVal:
```
uint64_t tappVal = code & TAPP_FIELD_MASK;
if (tappVal != 0) {
    switch (error)  // <-- should be switch(tappVal)
```
If both TAPP and cuTENSOR errors are packed, error will include the cuTENSOR bits and never match cases 1-15.
cutensor_demo.cpp:2678 — Wrong copy size in conjugate() test:
```
cudaMemcpy((void*)D, (void*)D_d, 9 * sizeof(float), cudaMemcpyDeviceToHost);
```
D is std::complex<float>[9], so this should be 9 * sizeof(std::complex<float>). Only half the data is copied back.
error.cpp:853 — CUDA error packing clears TAPP+cuTENSOR fields:
```
uint64_t cleared_val = val & (~LOW_FIELDS_MASK);
return static_cast<int>(cleared_val | new_cuda_val);
```
This discards any previously packed TAPP/cuTENSOR errors. The other pack_error overloads preserve other fields, but this one doesn't. Inconsistent behavior.

Memory safety

execute_product in product.cpp — Early returns leak GPU memory. Every if (cerr != cudaSuccess) return pack_error(0, cerr) between cudaMallocAsync calls will leak all previously allocated device buffers (A_d, B_d, C_d, D_d, E_d, contraction_work). Consider using RAII or a goto-cleanup pattern.
create_tensor_product in product.cpp — Early returns leak plan_struct and partial state. If any cuTENSOR call fails after new product_plan, the plan_struct and its dynamically allocated members are leaked.
execute_product — perm_scalar_ptr uses malloc but is never freed on error path (line ~1216 returns before free(perm_scalar_ptr) if cutensorPermute fails).

Style / quality

Missing newlines at end of file in essentially all new headers and source files under cutensor_bindings/. Most tools and compilers warn about this.
Unreachable break statements after return in switch cases throughout datatype.cpp and product.cpp (translate_operator, translate_datatype, etc.). Harmless but noisy.
VLA usage (int64_t sorted_strides_D[TAPP_get_nmodes(D)] in product.cpp, int64_t section_coordinates_D[...] in execute_product). VLAs are not standard C++ and are a compiler extension. Consider using std::vector or new[].
Magic number 15 for "invalid key" in attributes.cpp. This should use a named constant or the error enum.
cmake_minimum_required(VERSION 3.17) inside CMakeLists.txt at line 198 — cmake_minimum_required should only be called once at the top of the project. This is a policy change mid-file. Use if(CMAKE_VERSION VERSION_LESS 3.17) / message(FATAL_ERROR ...) instead, or bump the top-level requirement.
cutensor_bindings/CMakeLists.txt:338-341 — target_link_libraries(cutensor::cutensor INTERFACE CUDA::cudart) modifies an IMPORTED target's link interface. This is a surprising side effect — it means anyone finding cuTENSOR through this build gets CUDA::cudart added transitively, even if they didn't want it. Consider linking CUDA::cudart to tapp-cutensor directly instead (which is already done on line 370).

CMake

examples/CMakeLists.txt:1565 — tapp-reference-exercise_tucker_answers links against tapp-reference (old target name) instead of tapp::reference. Inconsistent with the rest of the migration.
test/CMakeLists.txt — The dynamic test/demo targets are only built when TAPP_CUTENSOR is enabled, but they dlopen shared libraries at runtime and don't actually depend on cuTENSOR at compile time. Could they be useful without cuTENSOR too (e.g., testing two reference implementations)?

Test infrastructure

test_dynamic.h — pathA and pathB are hardcoded as "./libtapp-reference.so" and "./libtapp-cutensor.so". This won't work on macOS (.dylib) or if the build output is in a different directory. These should be configurable, e.g., via CMake configure_file or command-line arguments.
test_dynamic.cpp line 7257 — Syntax error in commented-out code: str(test_mixed_strides(impA, impB) has mismatched parens.

Minor / positive notes

The CMake refactoring (pushing test/example targets into subdirectories) is a good cleanup
TAPP_REFERENCE_ENABLE_TBLIS -> TAPP_REFERENCE_USE_TBLIS rename is more descriptive
The printf("%s", message_buff) fix (from printf(message_buff)) is a correct format-string vulnerability fix
reduce_isolated_indices rename from contract_unique_idx is clearer
The conditional cleanup fix in run_tblis_mult (checking tblis_A_reduced != &tblis_A before freeing) fixes a real bug
The rand() change from -max() to min() avoids UB with signed overflow

SpaceyLake and others added 30 commits February 2, 2026 16:17

First stage of cutensor wrapper, only works with basic strides

b3da13a

Added the use of handle

362962c

Updated bindings allowing for non-contigous output tensor.

f2ed80f

Modified to work with current CuTensor bindings

933fba4

Added functionality for elemental operation on D

a2d46d3

Fixed function name

00e90e5

Fixed precision type

439d5cf

Small sectioning optimization

e8f86f0

Fixed scalar for permute D

412f1fe

Fixed sectioning

f584e7d

Created a demo version that loads libraries dynamically

2b2ecec

Created a test version that loads libraries dynamically

29230cb

Simple exapmle of using CuTensor

aa69f9a

Made cuda stream a part of TAPP_executor

f407841

Algorithm correction

4ca108b

Added cutensor handle to TAPP_handle

a917783

Corrected copying of memory

d80d06f

cutensor error handling

f8e70fb

can compile with cmake

87cdea5

Fixed typo

3353f35

Added the handle to create tensor info

31b44ba

Added handle when creating tensor info in old files

0d67763

Uncommented code

7dbaf36

Made test use tblis instead of cutensor

81e8234

Added the use of attributes to decide if input is on host or device

c6d6737

Added demo for cutensor with on device input

9f361ad

Dynamic demo running on cutensor with attribute to telling use of hos…

2a466f3

…t memory

Updated error handling

7f061fa

Updated function calls with create executor and handle as part of the…

d701639

… api

Added define statement

f6838a0

SpaceyLake and others added 29 commits February 24, 2026 14:48

Restructure, with own CMake for the bindings

6c2be1d

Removed depricated code

87436c9

Removed more depricated code

cef44f6

Update exercises

973c1b0

Changed comments

b1996aa

Seeing to it that the examples have create and destroy handles

5e6f88c

make permutation path in cutensor optional, fix cmake, fix setting us…

8f31742

…e_device_memory and demo-dynamic, test-dynamic

skip syncing stream, unless offloading

c1e6db3

handle memory via Async allocation using stream (executor)

69b1158

fix type TAPP_attr_get

a351f28

Fixed a bug where generation of test with subtensor with lower number…

8dc4da8

… of modes could create differences in C and D

Workaround, only doing reductions when necessary, avoiding some cases…

ca12525

… that doesn't work for TBLIS right now

Put alpha and beta to more appropriate values

b64966a

[cutensor] slim down cmake harness + no need for CUDA

edf664a

[cutensor] cleanup CMake yet more, missing/misnamed headers

2ef1368

[cmake] push down tests/examples CMake code into the respective subdirs

922c7b2

[cutensor] tapp-reference-cutensor -> tapp-cutensor

8d589d5

Fixed alpha, beta range for dynamic test

589be46

Moved includes to header files

c60462a

Added missed semicolon

5a520d9

include cutensor.h instead of cutensor/types.h to inject cuda_runtime.h

c8a2d36

Corrected paths for the dynamically loaded libs

4917a73

Removed accidental character

03a03fd

Removed cuda from languages

923e2b1

Merge branch 'cutensor_bindings' of github.com:TAPPorg/reference-impl…

447e382

…ementation into cutensor_bindings

Removed old, unused file

b2ee699

Changed random seed because seed 0 generates cases that doesn't agree…

e2f1262

… with TBLIS

Fixed directories when testing

5eded62

Further directory fix when for tests

53089b9

evaleev reviewed Feb 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Cutensor bindings#38

Cutensor bindings#38
SpaceyLake wants to merge 195 commits intomainfrom
cutensor_bindings

SpaceyLake commented Feb 9, 2026

Uh oh!

evaleev left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

SpaceyLake commented Feb 9, 2026

Uh oh!

evaleev left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

PR #38: Cutensor Bindings — Review

Summary

High-level concerns

Specific issues

Bugs / correctness

Memory safety

Style / quality

CMake

Test infrastructure

Minor / positive notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

evaleev left a comment •

edited

Loading