Conversation
…e_device_memory and demo-dynamic, test-dynamic
… of modes could create differences in C and D
… that doesn't work for TBLIS right now
…ementation into cutensor_bindings
There was a problem hiding this comment.
prepared by claude, edited by me
PR #38: Cutensor Bindings — Review
+8,974 / -910 across 41 files | CI: All checks pass
Summary
This PR adds cuTENSOR bindings for the TAPP API, refactors the CMake build system (pushing test/example targets into subdirectories), adds a TAPP_handle parameter to TAPP_create_tensor_info (API-breaking change), renames TAPP_REFERENCE_ENABLE_TBLIS to TAPP_REFERENCE_USE_TBLIS, adds dynamic-loading test infrastructure, and removes some deprecated files.
High-level concerns
-
API-breaking change to
TAPP_create_tensor_info— AddingTAPP_handleas a new parameter changes the public API. The reference implementation (reference_implementation/src/tensor.c) accepts the parameter but ignores it. This is the right design (the handle is needed by cuTENSOR but not by the reference impl), but consider whether this needs a version bump or changelog entry. -
Negative strides and
negative_strtest disabled indemo.c— The negative stride test is commented out indemo.cwith a cuTENSOR-specific comment, butdemo.clinks againsttapp::reference, nottapp::cutensor. Disabling it here penalizes the reference implementation's test coverage for a cuTENSOR limitation. Consider keeping it enabled for the reference demo and only disabling it in cuTENSOR-specific tests. -
Massive code duplication:
test_dynamic.cpp(4,079 lines) — This is essentially a copy-paste oftest.cppwith all calls going through astruct impfunction-pointer table. Same fordemo_dynamic.cvsdemo.c. This creates a significant maintenance burden — any future test change must be made in both places. Consider using macros or templates to share the test logic.
Specific issues
Bugs / correctness
-
product.cpp:952— Wrong handle cast:plan_struct->handle = ((cutensorHandle_t*) handle); struct handle* handle_struct = (struct handle*) plan_struct->handle;
handleis aTAPP_handle(i.e.,intptr_t) that actually points to astruct handle. First it's cast tocutensorHandle_t*and stored, then the storedcutensorHandle_t*is cast tostruct handle*. This only works by accident because thecutensorHandle_t* libhandleis the first member ofstruct handle. This is fragile and incorrect —plan_struct->handleshould be typed asstruct handle*or at minimum the first cast should be(struct handle*). -
attributes.cpp:575—memcpyto/fromintptr_tas pointer:memcpy((void*)handle_struct->attributes[0], value, sizeof(bool));
attributes[0]is anintptr_tholding abool*. The cast(void*)handle_struct->attributes[0]is correct, but the design is fragile — theintptr_t*array is a poor man's type-erased attribute store. Consider at minimum documenting the ownership model. -
error.cpp:754— Extracting TAPP field then switching onerrorinstead oftappVal:uint64_t tappVal = code & TAPP_FIELD_MASK; if (tappVal != 0) { switch (error) // <-- should be switch(tappVal)
If both TAPP and cuTENSOR errors are packed,
errorwill include the cuTENSOR bits and never match cases 1-15. -
cutensor_demo.cpp:2678— Wrong copy size inconjugate()test:cudaMemcpy((void*)D, (void*)D_d, 9 * sizeof(float), cudaMemcpyDeviceToHost);
Disstd::complex<float>[9], so this should be9 * sizeof(std::complex<float>). Only half the data is copied back. -
error.cpp:853— CUDA error packing clears TAPP+cuTENSOR fields:uint64_t cleared_val = val & (~LOW_FIELDS_MASK); return static_cast<int>(cleared_val | new_cuda_val);
This discards any previously packed TAPP/cuTENSOR errors. The other
pack_erroroverloads preserve other fields, but this one doesn't. Inconsistent behavior.
Memory safety
-
execute_productinproduct.cpp— Early returns leak GPU memory. Everyif (cerr != cudaSuccess) return pack_error(0, cerr)betweencudaMallocAsynccalls will leak all previously allocated device buffers (A_d,B_d,C_d,D_d,E_d,contraction_work). Consider using RAII or a goto-cleanup pattern. -
create_tensor_productinproduct.cpp— Early returns leakplan_structand partial state. If any cuTENSOR call fails afternew product_plan, theplan_structand its dynamically allocated members are leaked. -
execute_product—perm_scalar_ptrusesmallocbut is never freed on error path (line ~1216 returns beforefree(perm_scalar_ptr)ifcutensorPermutefails).
Style / quality
-
Missing newlines at end of file in essentially all new headers and source files under
cutensor_bindings/. Most tools and compilers warn about this. -
Unreachable
breakstatements afterreturnin switch cases throughoutdatatype.cppandproduct.cpp(translate_operator,translate_datatype, etc.). Harmless but noisy. -
VLA usage (
int64_t sorted_strides_D[TAPP_get_nmodes(D)]inproduct.cpp,int64_t section_coordinates_D[...]inexecute_product). VLAs are not standard C++ and are a compiler extension. Consider usingstd::vectorornew[]. -
Magic number
15for "invalid key" inattributes.cpp. This should use a named constant or the error enum. -
cmake_minimum_required(VERSION 3.17)insideCMakeLists.txtat line 198 —cmake_minimum_requiredshould only be called once at the top of the project. This is a policy change mid-file. Useif(CMAKE_VERSION VERSION_LESS 3.17)/message(FATAL_ERROR ...)instead, or bump the top-level requirement. -
cutensor_bindings/CMakeLists.txt:338-341—target_link_libraries(cutensor::cutensor INTERFACE CUDA::cudart)modifies an IMPORTED target's link interface. This is a surprising side effect — it means anyone finding cuTENSOR through this build gets CUDA::cudart added transitively, even if they didn't want it. Consider linking CUDA::cudart totapp-cutensordirectly instead (which is already done on line 370).
CMake
-
examples/CMakeLists.txt:1565—tapp-reference-exercise_tucker_answerslinks againsttapp-reference(old target name) instead oftapp::reference. Inconsistent with the rest of the migration. -
test/CMakeLists.txt— The dynamic test/demo targets are only built whenTAPP_CUTENSORis enabled, but theydlopenshared libraries at runtime and don't actually depend on cuTENSOR at compile time. Could they be useful without cuTENSOR too (e.g., testing two reference implementations)?
Test infrastructure
-
test_dynamic.h—pathAandpathBare hardcoded as"./libtapp-reference.so"and"./libtapp-cutensor.so". This won't work on macOS (.dylib) or if the build output is in a different directory. These should be configurable, e.g., via CMakeconfigure_fileor command-line arguments. -
test_dynamic.cppline 7257 — Syntax error in commented-out code:str(test_mixed_strides(impA, impB)has mismatched parens.
Minor / positive notes
- The CMake refactoring (pushing test/example targets into subdirectories) is a good cleanup
TAPP_REFERENCE_ENABLE_TBLIS->TAPP_REFERENCE_USE_TBLISrename is more descriptive- The
printf("%s", message_buff)fix (fromprintf(message_buff)) is a correct format-string vulnerability fix reduce_isolated_indicesrename fromcontract_unique_idxis clearer- The conditional cleanup fix in
run_tblis_mult(checkingtblis_A_reduced != &tblis_Abefore freeing) fixes a real bug - The
rand()change from-max()tomin()avoids UB with signed overflow
Bindings to cutensor. This adds a handle to create_tensor_info. Setters for the tensor_info are not implemented because of complications. This code also includes a version of the test that loads implementations dynamically and a version of the demo that does the same. It also includes a cutensor-specific demo.
I also removed some deprecated code on this branch.
The code that is run on CUDA doesn't get automatically tested because standard GitHub runners only use CPUs.
The code uses an attribute to allow the use of on-device memory or not.