Skip to content

HPCG_MEMMGMT broken by dual-target CUDA/HIP refactoring #1

@dmitry-mikushin

Description

@dmitry-mikushin

Problem Description

The custom memory management system (HPCG_MEMMGMT) was broken by commit abb79b3 which added dual-target CUDA and HIP support. When HPCG_MEMMGMT is enabled, the program crashes with "invalid device pointer" errors.

Root Cause

The refactoring replaced all deviceMalloc() calls (custom memory allocator) with gpuMalloc() calls (direct HIP/CUDA API). This creates a fundamental mismatch:

  1. Memory is allocated via gpuMalloc() (which calls hipMalloc/cudaMalloc directly)
  2. The custom memory manager doesn't track these allocations
  3. Later code tries to use deviceRealloc() and deviceDefrag() on this untracked memory
  4. The allocator fails because it doesn't recognize these pointers → "invalid device pointer" error

Examples of Problematic Changes

Before (working):

// src/GenerateProblem.cpp
HIP_CHECK(deviceMalloc((void**)&A.d_mtxIndG, ...));
HIP_CHECK(deviceMalloc((void**)&A.d_matrixValues, ...));

After (broken):

// src/GenerateProblem.cu
GPU_CHECK(gpuMalloc((void**)&A.d_mtxIndG, ...));
GPU_CHECK(gpuMalloc((void**)&A.d_matrixValues, ...));

Affected Files

  • src/GenerateProblem.cu - main matrix allocation
  • src/MultiColoring.cu - permutation array allocation
  • src/SetupHalo.cu - halo exchange buffers
  • Any other files that previously used deviceMalloc

Why Custom Memory Management Exists

The custom allocator (gpuAllocator_t in src/Memory.cu) provides several optimizations:

  1. Pre-allocation: Allocates one large buffer upfront (minimum 128MB) instead of many small allocations
  2. Reduced fragmentation: Manages memory within the large buffer with 2MB alignment
  3. Defragmentation: deviceDefrag() compacts data to improve memory locality and cache performance
  4. Memory tracking: Precise control and monitoring of GPU memory usage

Workaround

Currently disabled HPCG_MEMMGMT in CMake (commit b81cf27):

cmake . -DOPT_MEMMGMT=OFF

This allows the program to run but loses the performance benefits of the custom allocator.

Proposed Solutions

Option 1: Conditional Allocation Macro (Recommended)

Create a macro that switches between gpuMalloc and deviceMalloc based on HPCG_MEMMGMT:

// In gpu_compat.hpp or utils.hpp
#ifdef HPCG_MEMMGMT
#define GPU_MANAGED_MALLOC(ptr, size) deviceMalloc(ptr, size)
#define GPU_MANAGED_FREE(ptr) deviceFree(ptr)
#else
#define GPU_MANAGED_MALLOC(ptr, size) gpuMalloc(ptr, size)
#define GPU_MANAGED_FREE(ptr) gpuFree(ptr)
#endif

Then replace gpuMalloc with GPU_MANAGED_MALLOC in all files that need memory management support.

Option 2: Revert to deviceMalloc

Replace gpuMalloc back to deviceMalloc in the affected files:

  • src/GenerateProblem.cu
  • src/MultiColoring.cu
  • src/SetupHalo.cu
  • Others as identified

This is simpler but less flexible if direct GPU API calls are needed elsewhere.

Option 3: Fix deviceRealloc/deviceDefrag Usage

Remove or conditionally compile out the deviceRealloc and deviceDefrag calls in src/SparseMatrix.cu and src/OptimizeProblem.cu when memory wasn't allocated through the custom allocator. This is the current workaround but loses optimization benefits.

Option 4: Comprehensive Refactoring

Fully integrate the custom allocator with the dual-target support:

  • Ensure ALL GPU allocations use deviceMalloc when HPCG_MEMMGMT is enabled
  • Update the allocator to handle both CUDA and HIP backends
  • Add proper error handling for mixed allocation scenarios

Testing

To verify the fix works:

cmake . -DOPT_MEMMGMT=ON
make clean && make -j
./bin/rochpcg --nx=16 --ny=16

Should run without "invalid device pointer" errors.

Performance Impact

The custom memory manager is not just a convenience - it's a performance optimization. Disabling it may impact:

  • Memory allocation overhead (many small mallocs vs one large)
  • Memory fragmentation
  • Cache locality (defragmentation improves this)
  • Overall benchmark performance

References

  • Broken by: abb79b3
  • Workaround: b81cf27 (disabled HPCG_MEMMGMT)
  • Custom allocator implementation: src/Memory.cu, src/Memory.hpp

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions