-
Couldn't load subscription status.
- Fork 0
Description
Problem Description
The custom memory management system (HPCG_MEMMGMT) was broken by commit abb79b3 which added dual-target CUDA and HIP support. When HPCG_MEMMGMT is enabled, the program crashes with "invalid device pointer" errors.
Root Cause
The refactoring replaced all deviceMalloc() calls (custom memory allocator) with gpuMalloc() calls (direct HIP/CUDA API). This creates a fundamental mismatch:
- Memory is allocated via
gpuMalloc()(which callshipMalloc/cudaMallocdirectly) - The custom memory manager doesn't track these allocations
- Later code tries to use
deviceRealloc()anddeviceDefrag()on this untracked memory - The allocator fails because it doesn't recognize these pointers → "invalid device pointer" error
Examples of Problematic Changes
Before (working):
// src/GenerateProblem.cpp
HIP_CHECK(deviceMalloc((void**)&A.d_mtxIndG, ...));
HIP_CHECK(deviceMalloc((void**)&A.d_matrixValues, ...));After (broken):
// src/GenerateProblem.cu
GPU_CHECK(gpuMalloc((void**)&A.d_mtxIndG, ...));
GPU_CHECK(gpuMalloc((void**)&A.d_matrixValues, ...));Affected Files
src/GenerateProblem.cu- main matrix allocationsrc/MultiColoring.cu- permutation array allocationsrc/SetupHalo.cu- halo exchange buffers- Any other files that previously used
deviceMalloc
Why Custom Memory Management Exists
The custom allocator (gpuAllocator_t in src/Memory.cu) provides several optimizations:
- Pre-allocation: Allocates one large buffer upfront (minimum 128MB) instead of many small allocations
- Reduced fragmentation: Manages memory within the large buffer with 2MB alignment
- Defragmentation:
deviceDefrag()compacts data to improve memory locality and cache performance - Memory tracking: Precise control and monitoring of GPU memory usage
Workaround
Currently disabled HPCG_MEMMGMT in CMake (commit b81cf27):
cmake . -DOPT_MEMMGMT=OFFThis allows the program to run but loses the performance benefits of the custom allocator.
Proposed Solutions
Option 1: Conditional Allocation Macro (Recommended)
Create a macro that switches between gpuMalloc and deviceMalloc based on HPCG_MEMMGMT:
// In gpu_compat.hpp or utils.hpp
#ifdef HPCG_MEMMGMT
#define GPU_MANAGED_MALLOC(ptr, size) deviceMalloc(ptr, size)
#define GPU_MANAGED_FREE(ptr) deviceFree(ptr)
#else
#define GPU_MANAGED_MALLOC(ptr, size) gpuMalloc(ptr, size)
#define GPU_MANAGED_FREE(ptr) gpuFree(ptr)
#endifThen replace gpuMalloc with GPU_MANAGED_MALLOC in all files that need memory management support.
Option 2: Revert to deviceMalloc
Replace gpuMalloc back to deviceMalloc in the affected files:
src/GenerateProblem.cusrc/MultiColoring.cusrc/SetupHalo.cu- Others as identified
This is simpler but less flexible if direct GPU API calls are needed elsewhere.
Option 3: Fix deviceRealloc/deviceDefrag Usage
Remove or conditionally compile out the deviceRealloc and deviceDefrag calls in src/SparseMatrix.cu and src/OptimizeProblem.cu when memory wasn't allocated through the custom allocator. This is the current workaround but loses optimization benefits.
Option 4: Comprehensive Refactoring
Fully integrate the custom allocator with the dual-target support:
- Ensure ALL GPU allocations use
deviceMallocwhenHPCG_MEMMGMTis enabled - Update the allocator to handle both CUDA and HIP backends
- Add proper error handling for mixed allocation scenarios
Testing
To verify the fix works:
cmake . -DOPT_MEMMGMT=ON
make clean && make -j
./bin/rochpcg --nx=16 --ny=16Should run without "invalid device pointer" errors.
Performance Impact
The custom memory manager is not just a convenience - it's a performance optimization. Disabling it may impact:
- Memory allocation overhead (many small mallocs vs one large)
- Memory fragmentation
- Cache locality (defragmentation improves this)
- Overall benchmark performance
References
- Broken by: abb79b3
- Workaround: b81cf27 (disabled HPCG_MEMMGMT)
- Custom allocator implementation:
src/Memory.cu,src/Memory.hpp