[SYCL][UR][CUDA] Move CUDA device memory pools to the context #17411

ldorau · 2025-03-12T14:04:22Z

Move CUDA device memory pools to the context.

This PR fixes a segfault caused by #17450

Ref: #17450

ldorau · 2025-03-12T14:10:59Z

@igchor @bratpiorka @lukaszstolarczuk please review

unified-runtime/source/adapters/cuda/common.cpp

unified-runtime/source/common/umf_helpers.hpp

unified-runtime/source/adapters/cuda/common.cpp

unified-runtime/source/adapters/cuda/context.hpp

igchor · 2025-03-12T14:52:09Z

unified-runtime/source/adapters/cuda/context.hpp

+
+  // Create UMF CUDA memory pools for the device memory (UMF_MEMORY_TYPE_HOST)
+  // and the shared memory (UMF_MEMORY_TYPE_SHARED)
+  ur_result_t createDeviceMemoryPools() {


Is there a reason you can't just store a ur_usm_pool_handle_t_ in the context? The ctor of ur_usm_pool_handle_t_ does pretty much exactly the same thing as this and createHostMemoryPool function.

ur_usm_pool_handle_t_::ur_usm_pool_handle_t_ stores user Disjoint Pools in the set of pools std::set<ur_usm_pool_handle_t> PoolHandles;. But I want to store the default Proxy Pools in the vectors

std::vector<umf_memory_pool_handle_t> MemoryDevicePools; std::vector<umf_memory_pool_handle_t> MemorySharedPools;

Ok, so to me it looks like the ur_usm_pool_handle_t_ is implemented incorrectly. Look at how, urUSM*Alloc is implemented: https://github.com/intel/llvm/blob/sycl/unified-runtime/source/adapters/cuda/usm.cpp#L71 We are using the DevicePool, without even looking at the hDevice param. The ur_usm_pool_handle_t_ should consist of a vector of memory pools for device and shared allocations just as you implemented in context right now.

I know fixing this, will require a bit more work but it would make the code much simpler if we could just store ur_usm_pool_handle_t_ in the context (not to mention that right now, custom memory pools doesn't seem to work correctly). I won't block this PR but this should be fixed sooner rather than later.

I'll echo this.
I've a patch locally that creates a pool for different sets of allocation flags (write-combine, non-coherent, etc), and that gets the provider from the pool where needed using umfPoolGetMemoryProvider.

I'm happy for this to go in as-is for now, though as I can build on top of this work

igchor · 2025-03-12T14:55:28Z

unified-runtime/source/adapters/cuda/usm.cpp

+    umf_memory_provider_handle_t memorySharedProvider = nullptr;
+    UR_CHECK_ERROR(umf::createDeviceMemoryProviders(
+        Device, &memoryDeviceProvider, &memorySharedProvider));
+
    auto UmfDeviceParamsHandle = getUmfParamsHandle(
        DisjointPoolConfigs.Configs[usm::DisjointPoolMemType::Device]);
    DeviceMemPool = umf::poolMakeUniqueFromOpsProviderHandle(


Hm, just looking at this code - aren't you overwriting the same member variable (DeviceMemPool) multiple times here?

I think DeviceMemPool should actually be an array just as you made in context.cpp (addressed by device id). Same is true for SharedMemPool.

It is not my code and I have not changed this code. It was done in such way before. These two pools are added by Context->addPool(this); to the set of pools.

Okay, yeah, I see they are added to the set of pools, but it looks like there is no copy performed here—we're just adding ur_usm_pool_handle_t (that is a pointer to ur_usm_pool_handle_t_) to the std::set, and in the next iterations, we overwrite the values, no? As I mentioned in the previous comment, I think this implementation is incorrect and should be fixed eventually.

From now on disjoint_pool is part of libumf, instead of being a separate library. Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>

Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>

igchor

I'm approving the PR, but IMHO the current implementation (not touched by this PR) of user-defined pools is incorrect. Once it's fixed, the code could be simplified even further.

ldorau · 2025-03-13T10:07:34Z

@kbenzie @npmiller @frasercrmck @omarahmed1111 please review

npmiller · 2025-03-13T11:08:40Z

Is there a reason to move it over? Given how CUDA works I think the device is a better place for the device memory pools.

ldorau · 2025-03-13T11:43:54Z

Is there a reason to move it over? Given how CUDA works I think the device is a better place for the device memory pools.

Yes, there is a reason to move it over. There is an issue in SYCL/UR/CUDA adapter (which occurs in the sycl/test-e2e/Regression/static-buffer-dtor.cpp test) that the destructor of the CUDA device is called before the destructor of a static buffer defined in a user app (this test). The destructor of the CUDA device destroys all CUDA memory pools and later the destructor of a user static buffer tries to free CUDA memory from already destroyed memory pool what causes a segfault. This PR moves destroying CUDA memory pool to the context after freeing CUDA memory of the user static buffer.

I will submit an issue for this soon - when I collect all traces.

ldorau · 2025-03-13T13:26:14Z

Is there a reason to move it over? Given how CUDA works I think the device is a better place for the device memory pools.

Yes, there is a reason to move it over. There is an issue in SYCL/UR/CUDA adapter (which occurs in the sycl/test-e2e/Regression/static-buffer-dtor.cpp test) that the destructor of the CUDA device is called before the destructor of a static buffer defined in a user app (this test). The destructor of the CUDA device destroys all CUDA memory pools and later the destructor of a user static buffer tries to free CUDA memory from already destroyed memory pool what causes a segfault. This PR moves destroying CUDA memory pool to the context after freeing CUDA memory of the user static buffer.

I will submit an issue for this soon - when I collect all traces.

@npmiller This PR fixes a segfault caused by #17450

ldorau · 2025-03-13T13:30:38Z

@npmiller If there was no #17450 issue causing a segfault in the sycl/test-e2e/Regression/static-buffer-dtor.cpp test, this change would not be needed.

npmiller · 2025-03-13T14:19:53Z

Yes, there is a reason to move it over. There is an issue in SYCL/UR/CUDA adapter (which occurs in the sycl/test-e2e/Regression/static-buffer-dtor.cpp test) that the destructor of the CUDA device is called before the destructor of a static buffer defined in a user app (this test). The destructor of the CUDA device destroys all CUDA memory pools and later the destructor of a user static buffer tries to free CUDA memory from already destroyed memory pool what causes a segfault. This PR moves destroying CUDA memory pool to the context after freeing CUDA memory of the user static buffer.

Oh I see, that's a good catch!

But I'm still a little confused, do you know why the context destructor doesn't also trigger this issue? I would expect the device destructor to always be called after the context destructor since a context contains devices.

ldorau · 2025-03-13T14:26:37Z

Yes, there is a reason to move it over. There is an issue in SYCL/UR/CUDA adapter (which occurs in the sycl/test-e2e/Regression/static-buffer-dtor.cpp test) that the destructor of the CUDA device is called before the destructor of a static buffer defined in a user app (this test). The destructor of the CUDA device destroys all CUDA memory pools and later the destructor of a user static buffer tries to free CUDA memory from already destroyed memory pool what causes a segfault. This PR moves destroying CUDA memory pool to the context after freeing CUDA memory of the user static buffer.

Oh I see, that's a good catch!

But I'm still a little confused, do you know why the context destructor doesn't also trigger this issue? I would expect the device destructor to always be called after the context destructor since a context contains devices.

Yes, the device destructor should always be called after the context destructor since a context contains devices, but it is NOT now and this is what the existing issue (#17450) is about - that the device destructor is called too early ...

ldrumm

This actually seems to align fairly well with some changes I have locally. I'll echo the pools vs provider discussion here though: see comment below

npmiller · 2025-03-13T15:47:28Z

Yes, there is a reason to move it over. There is an issue in SYCL/UR/CUDA adapter (which occurs in the sycl/test-e2e/Regression/static-buffer-dtor.cpp test) that the destructor of the CUDA device is called before the destructor of a static buffer defined in a user app (this test). The destructor of the CUDA device destroys all CUDA memory pools and later the destructor of a user static buffer tries to free CUDA memory from already destroyed memory pool what causes a segfault. This PR moves destroying CUDA memory pool to the context after freeing CUDA memory of the user static buffer.

Oh I see, that's a good catch!
But I'm still a little confused, do you know why the context destructor doesn't also trigger this issue? I would expect the device destructor to always be called after the context destructor since a context contains devices.

Yes, the device destructor should always be called after the context destructor since a context contains devices, but it is NOT now and this is what the existing issue (#17450) is about - that the device destructor is called too early ...

So moving the pools from the device to the context doesn't really fix anything, it's just that since the UR device doesn't actually do much for the CUDA adapter we get kinda lucky that the context can still cleanup even after the devices were deleted. It's pretty hacky but I suppose maybe it's an okay workaround until we can properly fix the device destructor.

ldorau · 2025-03-14T07:55:27Z

Yes, there is a reason to move it over. There is an issue in SYCL/UR/CUDA adapter (which occurs in the sycl/test-e2e/Regression/static-buffer-dtor.cpp test) that the destructor of the CUDA device is called before the destructor of a static buffer defined in a user app (this test). The destructor of the CUDA device destroys all CUDA memory pools and later the destructor of a user static buffer tries to free CUDA memory from already destroyed memory pool what causes a segfault. This PR moves destroying CUDA memory pool to the context after freeing CUDA memory of the user static buffer.

Oh I see, that's a good catch!
But I'm still a little confused, do you know why the context destructor doesn't also trigger this issue? I would expect the device destructor to always be called after the context destructor since a context contains devices.

Yes, the device destructor should always be called after the context destructor since a context contains devices, but it is NOT now and this is what the existing issue (#17450) is about - that the device destructor is called too early ...

So moving the pools from the device to the context doesn't really fix anything, it's just that since the UR device doesn't actually do much for the CUDA adapter we get kinda lucky that the context can still cleanup even after the devices were deleted. It's pretty hacky but I suppose maybe it's an okay workaround until we can properly fix the device destructor.

@npmiller This PR is a workaround for the segafult:

There are a segfault and a memory leak in the current code when running the test:

ur_device_handle_t_::~ur_device_handle_t_() destructor is called far too early (see the 17450 issue) and it ...
a) destroys CUDA memory pools and
b) destroys the CUDA device.
sycl::~buffer destructor is called and it ...
a) tries to free CUDA memory from the already destroyed CUDA memory pool (see 1a)
b) freeing CUDA memory fails, because the CUDA device has already been destroyed (see 1b) - it is a memory leak
c) calling umfPoolFree(pool, ptr) on the already destroyed CUDA memory pool causes a segfault.
ur_context_handle_t_::~ur_context_handle_t_() destructor is called and it ...
a) destroys the host memory pool

This pull request is a workaround for this segafult - it moves CUDA device memory pools from ur_device_handle_t_ to ur_context_handle_t_, so there will be only a memory leak caused by the 17450 issue:

ur_device_handle_t_::~ur_device_handle_t_() destructor is called far too early (see the 17450 issue) and it ...
a) destroys the CUDA device.
sycl::~buffer destructor is called and it ...
a) tries to free CUDA memory
b) freeing CUDA memory fails, because the CUDA device has already been destroyed - it is a memory leak
c) calling umfPoolFree(pool, ptr) just fails.
ur_context_handle_t_::~ur_context_handle_t_() destructor is called and it ...
a) destroys CUDA memory pools and
b) destroys the host memory pool

The ideal fix for the 17450 issue should move calling the ur_device_handle_t_::~ur_device_handle_t_() destructor after the ur_context_handle_t_::~ur_context_handle_t_() destructor is called:

sycl::~buffer destructor is called and it ...
a) frees CUDA memory by calling umfPoolFree(pool, ptr) on the CUDA memory pool
ur_context_handle_t_::~ur_context_handle_t_() destructor is called and it ...
a) destroys the host memory pool
ur_device_handle_t_::~ur_device_handle_t_() destructor is called at the end and it ...
a) destroys CUDA memory pools and
b) destroys the CUDA device.

so the most appropriate/correct location of CUDA device memory pools is the current one (in ur_device_handle_t_), but because of the 17450 issue it causes a segfault now.

ldorau · 2025-03-14T10:36:00Z

@npmiller is it OK? Are you going to approve this PR?

ldorau · 2025-03-14T14:48:16Z

Replaced with #17468 (the test has been disabled)

ldorau temporarily deployed to WindowsCILock March 12, 2025 14:05 — with GitHub Actions Inactive

This was referenced Mar 12, 2025

Fix segfault in cu_memory_provider_get_last_native_error() oneapi-src/unified-memory-framework#1183

Merged

Handle CUDA_ERROR_DEINITIALIZED error in cu2umf_result() oneapi-src/unified-memory-framework#1184

Merged

ldorau temporarily deployed to WindowsCILock March 12, 2025 14:54 — with GitHub Actions Inactive

igchor reviewed Mar 12, 2025

View reviewed changes

ldorau force-pushed the Move_CUDA_device_memory_pools_to_the_context branch from be92a90 to ee804fa Compare March 13, 2025 08:21

ldorau temporarily deployed to WindowsCILock March 13, 2025 08:22 — with GitHub Actions Inactive

ldorau requested a review from igchor March 13, 2025 08:24

ldorau marked this pull request as ready for review March 13, 2025 08:24

ldorau requested review from a team as code owners March 13, 2025 08:24

ldorau requested a review from ldrumm March 13, 2025 08:24

ldorau marked this pull request as draft March 13, 2025 08:26

ldorau temporarily deployed to WindowsCILock March 13, 2025 08:37 — with GitHub Actions Inactive

ldorau added 2 commits March 13, 2025 10:31

[UR] Bump UMF to v0.11.0-dev4

9cf5f6f

From now on disjoint_pool is part of libumf, instead of being a separate library. Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>

[SYCL][UR][CUDA] Move CUDA device memory pools to the context

591d28b

Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>

ldorau force-pushed the Move_CUDA_device_memory_pools_to_the_context branch from ee804fa to 591d28b Compare March 13, 2025 09:33

ldorau temporarily deployed to WindowsCILock March 13, 2025 09:33 — with GitHub Actions Inactive

ldorau marked this pull request as ready for review March 13, 2025 09:34

ldorau temporarily deployed to WindowsCILock March 13, 2025 09:53 — with GitHub Actions Inactive

igchor approved these changes Mar 13, 2025

View reviewed changes

bratpiorka approved these changes Mar 13, 2025

View reviewed changes

ldrumm approved these changes Mar 13, 2025

View reviewed changes

ldorau mentioned this pull request Mar 14, 2025

[SYCL][UR][CUDA] wrong order of the ~ur_device_handle_t_() destructor and a user-app static buffer destructor #17450

Closed

ldorau marked this pull request as draft March 14, 2025 11:33

ldorau closed this Mar 14, 2025

igchor mentioned this pull request Mar 20, 2025

Introduce new properties for USM allocations #17539

Open

[SYCL][UR][CUDA] Move CUDA device memory pools to the context #17411

[SYCL][UR][CUDA] Move CUDA device memory pools to the context #17411

Uh oh!

Conversation

ldorau commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ldorau commented Mar 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

igchor Mar 12, 2025

Choose a reason for hiding this comment

Uh oh!

ldorau Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

igchor Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

ldrumm Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

igchor Mar 12, 2025

Choose a reason for hiding this comment

Uh oh!

ldorau Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

igchor Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

igchor left a comment

Choose a reason for hiding this comment

Uh oh!

ldorau commented Mar 13, 2025

Uh oh!

npmiller commented Mar 13, 2025

Uh oh!

ldorau commented Mar 13, 2025

Uh oh!

ldorau commented Mar 13, 2025

Uh oh!

ldorau commented Mar 13, 2025

Uh oh!

npmiller commented Mar 13, 2025

Uh oh!

ldorau commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ldrumm left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

npmiller commented Mar 13, 2025

Uh oh!

ldorau commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ldorau commented Mar 14, 2025

Uh oh!

ldorau commented Mar 14, 2025

Uh oh!

Uh oh!

ldorau commented Mar 12, 2025 •

edited

Loading

ldorau Mar 13, 2025 •

edited

Loading

ldorau commented Mar 13, 2025 •

edited

Loading

ldrumm left a comment •

edited

Loading

ldorau commented Mar 14, 2025 •

edited

Loading