HPCG_MEMMGMT broken by dual-target CUDA/HIP refactoring

## Problem Description

The custom memory management system (`HPCG_MEMMGMT`) was broken by commit abb79b3c1531e7a1b293c294d5595454fa5e2573 which added dual-target CUDA and HIP support. When `HPCG_MEMMGMT` is enabled, the program crashes with "invalid device pointer" errors.

## Root Cause

The refactoring replaced all `deviceMalloc()` calls (custom memory allocator) with `gpuMalloc()` calls (direct HIP/CUDA API). This creates a fundamental mismatch:

1. Memory is allocated via `gpuMalloc()` (which calls `hipMalloc`/`cudaMalloc` directly)
2. The custom memory manager doesn't track these allocations
3. Later code tries to use `deviceRealloc()` and `deviceDefrag()` on this untracked memory
4. The allocator fails because it doesn't recognize these pointers → "invalid device pointer" error

### Examples of Problematic Changes

**Before (working):**
```cpp
// src/GenerateProblem.cpp
HIP_CHECK(deviceMalloc((void**)&A.d_mtxIndG, ...));
HIP_CHECK(deviceMalloc((void**)&A.d_matrixValues, ...));
```

**After (broken):**
```cpp
// src/GenerateProblem.cu
GPU_CHECK(gpuMalloc((void**)&A.d_mtxIndG, ...));
GPU_CHECK(gpuMalloc((void**)&A.d_matrixValues, ...));
```

### Affected Files
- `src/GenerateProblem.cu` - main matrix allocation
- `src/MultiColoring.cu` - permutation array allocation
- `src/SetupHalo.cu` - halo exchange buffers
- Any other files that previously used `deviceMalloc`

## Why Custom Memory Management Exists

The custom allocator (`gpuAllocator_t` in `src/Memory.cu`) provides several optimizations:

1. **Pre-allocation**: Allocates one large buffer upfront (minimum 128MB) instead of many small allocations
2. **Reduced fragmentation**: Manages memory within the large buffer with 2MB alignment
3. **Defragmentation**: `deviceDefrag()` compacts data to improve memory locality and cache performance
4. **Memory tracking**: Precise control and monitoring of GPU memory usage

## Workaround

Currently disabled `HPCG_MEMMGMT` in CMake (commit b81cf27):
```cmake
cmake . -DOPT_MEMMGMT=OFF
```

This allows the program to run but loses the performance benefits of the custom allocator.

## Proposed Solutions

### Option 1: Conditional Allocation Macro (Recommended)

Create a macro that switches between `gpuMalloc` and `deviceMalloc` based on `HPCG_MEMMGMT`:

```cpp
// In gpu_compat.hpp or utils.hpp
#ifdef HPCG_MEMMGMT
#define GPU_MANAGED_MALLOC(ptr, size) deviceMalloc(ptr, size)
#define GPU_MANAGED_FREE(ptr) deviceFree(ptr)
#else
#define GPU_MANAGED_MALLOC(ptr, size) gpuMalloc(ptr, size)
#define GPU_MANAGED_FREE(ptr) gpuFree(ptr)
#endif
```

Then replace `gpuMalloc` with `GPU_MANAGED_MALLOC` in all files that need memory management support.

### Option 2: Revert to deviceMalloc

Replace `gpuMalloc` back to `deviceMalloc` in the affected files:
- `src/GenerateProblem.cu`
- `src/MultiColoring.cu`
- `src/SetupHalo.cu`
- Others as identified

This is simpler but less flexible if direct GPU API calls are needed elsewhere.

### Option 3: Fix deviceRealloc/deviceDefrag Usage

Remove or conditionally compile out the `deviceRealloc` and `deviceDefrag` calls in `src/SparseMatrix.cu` and `src/OptimizeProblem.cu` when memory wasn't allocated through the custom allocator. This is the current workaround but loses optimization benefits.

### Option 4: Comprehensive Refactoring

Fully integrate the custom allocator with the dual-target support:
- Ensure ALL GPU allocations use `deviceMalloc` when `HPCG_MEMMGMT` is enabled
- Update the allocator to handle both CUDA and HIP backends
- Add proper error handling for mixed allocation scenarios

## Testing

To verify the fix works:
```bash
cmake . -DOPT_MEMMGMT=ON
make clean && make -j
./bin/rochpcg --nx=16 --ny=16
```

Should run without "invalid device pointer" errors.

## Performance Impact

The custom memory manager is not just a convenience - it's a performance optimization. Disabling it may impact:
- Memory allocation overhead (many small mallocs vs one large)
- Memory fragmentation
- Cache locality (defragmentation improves this)
- Overall benchmark performance

## References

- Broken by: abb79b3c1531e7a1b293c294d5595454fa5e2573
- Workaround: b81cf27 (disabled HPCG_MEMMGMT)
- Custom allocator implementation: `src/Memory.cu`, `src/Memory.hpp`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

HPCG_MEMMGMT broken by dual-target CUDA/HIP refactoring #1

Problem Description

Root Cause

Examples of Problematic Changes

Affected Files

Why Custom Memory Management Exists

Workaround

Proposed Solutions

Option 1: Conditional Allocation Macro (Recommended)

Option 2: Revert to deviceMalloc

Option 3: Fix deviceRealloc/deviceDefrag Usage

Option 4: Comprehensive Refactoring

Testing

Performance Impact

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

HPCG_MEMMGMT broken by dual-target CUDA/HIP refactoring #1

Description

Problem Description

Root Cause

Examples of Problematic Changes

Affected Files

Why Custom Memory Management Exists

Workaround

Proposed Solutions

Option 1: Conditional Allocation Macro (Recommended)

Option 2: Revert to deviceMalloc

Option 3: Fix deviceRealloc/deviceDefrag Usage

Option 4: Comprehensive Refactoring

Testing

Performance Impact

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions