Skip to content

Comments

Cutensor bindings#38

Open
SpaceyLake wants to merge 195 commits intomainfrom
cutensor_bindings
Open

Cutensor bindings#38
SpaceyLake wants to merge 195 commits intomainfrom
cutensor_bindings

Conversation

@SpaceyLake
Copy link
Collaborator

Bindings to cutensor. This adds a handle to create_tensor_info. Setters for the tensor_info are not implemented because of complications. This code also includes a version of the test that loads implementations dynamically and a version of the demo that does the same. It also includes a cutensor-specific demo.
I also removed some deprecated code on this branch.
The code that is run on CUDA doesn't get automatically tested because standard GitHub runners only use CPUs.
The code uses an attribute to allow the use of on-device memory or not.

SpaceyLake and others added 29 commits February 24, 2026 14:48
…e_device_memory and demo-dynamic, test-dynamic
… of modes could create differences in C and D
Copy link
Contributor

@evaleev evaleev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prepared by claude, edited by me

PR #38: Cutensor Bindings — Review

+8,974 / -910 across 41 files | CI: All checks pass

Summary

This PR adds cuTENSOR bindings for the TAPP API, refactors the CMake build system (pushing test/example targets into subdirectories), adds a TAPP_handle parameter to TAPP_create_tensor_info (API-breaking change), renames TAPP_REFERENCE_ENABLE_TBLIS to TAPP_REFERENCE_USE_TBLIS, adds dynamic-loading test infrastructure, and removes some deprecated files.


High-level concerns

  1. API-breaking change to TAPP_create_tensor_info — Adding TAPP_handle as a new parameter changes the public API. The reference implementation (reference_implementation/src/tensor.c) accepts the parameter but ignores it. This is the right design (the handle is needed by cuTENSOR but not by the reference impl), but consider whether this needs a version bump or changelog entry.

  2. Negative strides and negative_str test disabled in demo.c — The negative stride test is commented out in demo.c with a cuTENSOR-specific comment, but demo.c links against tapp::reference, not tapp::cutensor. Disabling it here penalizes the reference implementation's test coverage for a cuTENSOR limitation. Consider keeping it enabled for the reference demo and only disabling it in cuTENSOR-specific tests.

  3. Massive code duplication: test_dynamic.cpp (4,079 lines) — This is essentially a copy-paste of test.cpp with all calls going through a struct imp function-pointer table. Same for demo_dynamic.c vs demo.c. This creates a significant maintenance burden — any future test change must be made in both places. Consider using macros or templates to share the test logic.


Specific issues

Bugs / correctness

  1. product.cpp:952 — Wrong handle cast:

    plan_struct->handle = ((cutensorHandle_t*) handle);
    struct handle* handle_struct = (struct handle*) plan_struct->handle;

    handle is a TAPP_handle (i.e., intptr_t) that actually points to a struct handle. First it's cast to cutensorHandle_t* and stored, then the stored cutensorHandle_t* is cast to struct handle*. This only works by accident because the cutensorHandle_t* libhandle is the first member of struct handle. This is fragile and incorrect — plan_struct->handle should be typed as struct handle* or at minimum the first cast should be (struct handle*).

  2. attributes.cpp:575memcpy to/from intptr_t as pointer:

    memcpy((void*)handle_struct->attributes[0], value, sizeof(bool));

    attributes[0] is an intptr_t holding a bool*. The cast (void*)handle_struct->attributes[0] is correct, but the design is fragile — the intptr_t* array is a poor man's type-erased attribute store. Consider at minimum documenting the ownership model.

  3. error.cpp:754 — Extracting TAPP field then switching on error instead of tappVal:

    uint64_t tappVal = code & TAPP_FIELD_MASK;
    if (tappVal != 0) {
        switch (error)  // <-- should be switch(tappVal)

    If both TAPP and cuTENSOR errors are packed, error will include the cuTENSOR bits and never match cases 1-15.

  4. cutensor_demo.cpp:2678 — Wrong copy size in conjugate() test:

    cudaMemcpy((void*)D, (void*)D_d, 9 * sizeof(float), cudaMemcpyDeviceToHost);

    D is std::complex<float>[9], so this should be 9 * sizeof(std::complex<float>). Only half the data is copied back.

  5. error.cpp:853 — CUDA error packing clears TAPP+cuTENSOR fields:

    uint64_t cleared_val = val & (~LOW_FIELDS_MASK);
    return static_cast<int>(cleared_val | new_cuda_val);

    This discards any previously packed TAPP/cuTENSOR errors. The other pack_error overloads preserve other fields, but this one doesn't. Inconsistent behavior.

Memory safety

  1. execute_product in product.cpp — Early returns leak GPU memory. Every if (cerr != cudaSuccess) return pack_error(0, cerr) between cudaMallocAsync calls will leak all previously allocated device buffers (A_d, B_d, C_d, D_d, E_d, contraction_work). Consider using RAII or a goto-cleanup pattern.

  2. create_tensor_product in product.cpp — Early returns leak plan_struct and partial state. If any cuTENSOR call fails after new product_plan, the plan_struct and its dynamically allocated members are leaked.

  3. execute_productperm_scalar_ptr uses malloc but is never freed on error path (line ~1216 returns before free(perm_scalar_ptr) if cutensorPermute fails).

Style / quality

  1. Missing newlines at end of file in essentially all new headers and source files under cutensor_bindings/. Most tools and compilers warn about this.

  2. Unreachable break statements after return in switch cases throughout datatype.cpp and product.cpp (translate_operator, translate_datatype, etc.). Harmless but noisy.

  3. VLA usage (int64_t sorted_strides_D[TAPP_get_nmodes(D)] in product.cpp, int64_t section_coordinates_D[...] in execute_product). VLAs are not standard C++ and are a compiler extension. Consider using std::vector or new[].

  4. Magic number 15 for "invalid key" in attributes.cpp. This should use a named constant or the error enum.

  5. cmake_minimum_required(VERSION 3.17) inside CMakeLists.txt at line 198 — cmake_minimum_required should only be called once at the top of the project. This is a policy change mid-file. Use if(CMAKE_VERSION VERSION_LESS 3.17) / message(FATAL_ERROR ...) instead, or bump the top-level requirement.

  6. cutensor_bindings/CMakeLists.txt:338-341target_link_libraries(cutensor::cutensor INTERFACE CUDA::cudart) modifies an IMPORTED target's link interface. This is a surprising side effect — it means anyone finding cuTENSOR through this build gets CUDA::cudart added transitively, even if they didn't want it. Consider linking CUDA::cudart to tapp-cutensor directly instead (which is already done on line 370).

CMake

  1. examples/CMakeLists.txt:1565tapp-reference-exercise_tucker_answers links against tapp-reference (old target name) instead of tapp::reference. Inconsistent with the rest of the migration.

  2. test/CMakeLists.txt — The dynamic test/demo targets are only built when TAPP_CUTENSOR is enabled, but they dlopen shared libraries at runtime and don't actually depend on cuTENSOR at compile time. Could they be useful without cuTENSOR too (e.g., testing two reference implementations)?

Test infrastructure

  1. test_dynamic.hpathA and pathB are hardcoded as "./libtapp-reference.so" and "./libtapp-cutensor.so". This won't work on macOS (.dylib) or if the build output is in a different directory. These should be configurable, e.g., via CMake configure_file or command-line arguments.

  2. test_dynamic.cpp line 7257 — Syntax error in commented-out code: str(test_mixed_strides(impA, impB) has mismatched parens.


Minor / positive notes

  • The CMake refactoring (pushing test/example targets into subdirectories) is a good cleanup
  • TAPP_REFERENCE_ENABLE_TBLIS -> TAPP_REFERENCE_USE_TBLIS rename is more descriptive
  • The printf("%s", message_buff) fix (from printf(message_buff)) is a correct format-string vulnerability fix
  • reduce_isolated_indices rename from contract_unique_idx is clearer
  • The conditional cleanup fix in run_tblis_mult (checking tblis_A_reduced != &tblis_A before freeing) fixes a real bug
  • The rand() change from -max() to min() avoids UB with signed overflow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants