Skip to content

[SYCL][UR][CUDA] Move CUDA device memory pools to the context #17411

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

ldorau
Copy link
Contributor

@ldorau ldorau commented Mar 12, 2025

Move CUDA device memory pools to the context.

This PR fixes a segfault caused by #17450

Ref: #17450

@ldorau
Copy link
Contributor Author

ldorau commented Mar 12, 2025

@igchor @bratpiorka @lukaszstolarczuk please review


// Create UMF CUDA memory pools for the device memory (UMF_MEMORY_TYPE_HOST)
// and the shared memory (UMF_MEMORY_TYPE_SHARED)
ur_result_t createDeviceMemoryPools() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you can't just store a ur_usm_pool_handle_t_ in the context? The ctor of ur_usm_pool_handle_t_ does pretty much exactly the same thing as this and createHostMemoryPool function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ur_usm_pool_handle_t_::ur_usm_pool_handle_t_ stores user Disjoint Pools in the set of pools std::set<ur_usm_pool_handle_t> PoolHandles;. But I want to store the default Proxy Pools in the vectors

  std::vector<umf_memory_pool_handle_t> MemoryDevicePools;
  std::vector<umf_memory_pool_handle_t> MemorySharedPools;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so to me it looks like the ur_usm_pool_handle_t_ is implemented incorrectly. Look at how, urUSM*Alloc is implemented: https://github.com/intel/llvm/blob/sycl/unified-runtime/source/adapters/cuda/usm.cpp#L71 We are using the DevicePool, without even looking at the hDevice param. The ur_usm_pool_handle_t_ should consist of a vector of memory pools for device and shared allocations just as you implemented in context right now.

I know fixing this, will require a bit more work but it would make the code much simpler if we could just store ur_usm_pool_handle_t_ in the context (not to mention that right now, custom memory pools doesn't seem to work correctly). I won't block this PR but this should be fixed sooner rather than later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll echo this.
I've a patch locally that creates a pool for different sets of allocation flags (write-combine, non-coherent, etc), and that gets the provider from the pool where needed using umfPoolGetMemoryProvider.

I'm happy for this to go in as-is for now, though as I can build on top of this work

umf_memory_provider_handle_t memorySharedProvider = nullptr;
UR_CHECK_ERROR(umf::createDeviceMemoryProviders(
Device, &memoryDeviceProvider, &memorySharedProvider));

auto UmfDeviceParamsHandle = getUmfParamsHandle(
DisjointPoolConfigs.Configs[usm::DisjointPoolMemType::Device]);
DeviceMemPool = umf::poolMakeUniqueFromOpsProviderHandle(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, just looking at this code - aren't you overwriting the same member variable (DeviceMemPool) multiple times here?

I think DeviceMemPool should actually be an array just as you made in context.cpp (addressed by device id). Same is true for SharedMemPool.

Copy link
Contributor Author

@ldorau ldorau Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not my code and I have not changed this code. It was done in such way before. These two pools are added by Context->addPool(this); to the set of pools.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, yeah, I see they are added to the set of pools, but it looks like there is no copy performed here—we're just adding ur_usm_pool_handle_t (that is a pointer to ur_usm_pool_handle_t_) to the std::set, and in the next iterations, we overwrite the values, no? As I mentioned in the previous comment, I think this implementation is incorrect and should be fixed eventually.

@ldorau ldorau force-pushed the Move_CUDA_device_memory_pools_to_the_context branch from be92a90 to ee804fa Compare March 13, 2025 08:21
@ldorau ldorau temporarily deployed to WindowsCILock March 13, 2025 08:22 — with GitHub Actions Inactive
@ldorau ldorau requested a review from igchor March 13, 2025 08:24
@ldorau ldorau marked this pull request as ready for review March 13, 2025 08:24
@ldorau ldorau requested review from a team as code owners March 13, 2025 08:24
@ldorau ldorau requested a review from ldrumm March 13, 2025 08:24
@ldorau ldorau marked this pull request as draft March 13, 2025 08:26
@ldorau ldorau temporarily deployed to WindowsCILock March 13, 2025 08:37 — with GitHub Actions Inactive
@ldorau ldorau temporarily deployed to WindowsCILock March 13, 2025 08:37 — with GitHub Actions Inactive
ldorau added 2 commits March 13, 2025 10:31
From now on disjoint_pool is part of libumf, instead of being
a separate library.

Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>
Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>
@ldorau ldorau force-pushed the Move_CUDA_device_memory_pools_to_the_context branch from ee804fa to 591d28b Compare March 13, 2025 09:33
@ldorau ldorau temporarily deployed to WindowsCILock March 13, 2025 09:33 — with GitHub Actions Inactive
@ldorau ldorau marked this pull request as ready for review March 13, 2025 09:34
@ldorau ldorau temporarily deployed to WindowsCILock March 13, 2025 09:53 — with GitHub Actions Inactive
@ldorau ldorau temporarily deployed to WindowsCILock March 13, 2025 09:53 — with GitHub Actions Inactive
Copy link
Member

@igchor igchor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approving the PR, but IMHO the current implementation (not touched by this PR) of user-defined pools is incorrect. Once it's fixed, the code could be simplified even further.

@ldorau
Copy link
Contributor Author

ldorau commented Mar 13, 2025

@kbenzie @npmiller @frasercrmck @omarahmed1111 please review

@npmiller
Copy link
Contributor

Is there a reason to move it over? Given how CUDA works I think the device is a better place for the device memory pools.

@ldorau
Copy link
Contributor Author

ldorau commented Mar 13, 2025

Is there a reason to move it over? Given how CUDA works I think the device is a better place for the device memory pools.

Yes, there is a reason to move it over. There is an issue in SYCL/UR/CUDA adapter (which occurs in the sycl/test-e2e/Regression/static-buffer-dtor.cpp test) that the destructor of the CUDA device is called before the destructor of a static buffer defined in a user app (this test). The destructor of the CUDA device destroys all CUDA memory pools and later the destructor of a user static buffer tries to free CUDA memory from already destroyed memory pool what causes a segfault. This PR moves destroying CUDA memory pool to the context after freeing CUDA memory of the user static buffer.

I will submit an issue for this soon - when I collect all traces.

@ldorau
Copy link
Contributor Author

ldorau commented Mar 13, 2025

Is there a reason to move it over? Given how CUDA works I think the device is a better place for the device memory pools.

Yes, there is a reason to move it over. There is an issue in SYCL/UR/CUDA adapter (which occurs in the sycl/test-e2e/Regression/static-buffer-dtor.cpp test) that the destructor of the CUDA device is called before the destructor of a static buffer defined in a user app (this test). The destructor of the CUDA device destroys all CUDA memory pools and later the destructor of a user static buffer tries to free CUDA memory from already destroyed memory pool what causes a segfault. This PR moves destroying CUDA memory pool to the context after freeing CUDA memory of the user static buffer.

I will submit an issue for this soon - when I collect all traces.

@npmiller This PR fixes a segfault caused by #17450

@ldorau
Copy link
Contributor Author

ldorau commented Mar 13, 2025

@npmiller If there was no #17450 issue causing a segfault in the sycl/test-e2e/Regression/static-buffer-dtor.cpp test, this change would not be needed.

@npmiller
Copy link
Contributor

Yes, there is a reason to move it over. There is an issue in SYCL/UR/CUDA adapter (which occurs in the sycl/test-e2e/Regression/static-buffer-dtor.cpp test) that the destructor of the CUDA device is called before the destructor of a static buffer defined in a user app (this test). The destructor of the CUDA device destroys all CUDA memory pools and later the destructor of a user static buffer tries to free CUDA memory from already destroyed memory pool what causes a segfault. This PR moves destroying CUDA memory pool to the context after freeing CUDA memory of the user static buffer.

Oh I see, that's a good catch!

But I'm still a little confused, do you know why the context destructor doesn't also trigger this issue? I would expect the device destructor to always be called after the context destructor since a context contains devices.

@ldorau
Copy link
Contributor Author

ldorau commented Mar 13, 2025

Yes, there is a reason to move it over. There is an issue in SYCL/UR/CUDA adapter (which occurs in the sycl/test-e2e/Regression/static-buffer-dtor.cpp test) that the destructor of the CUDA device is called before the destructor of a static buffer defined in a user app (this test). The destructor of the CUDA device destroys all CUDA memory pools and later the destructor of a user static buffer tries to free CUDA memory from already destroyed memory pool what causes a segfault. This PR moves destroying CUDA memory pool to the context after freeing CUDA memory of the user static buffer.

Oh I see, that's a good catch!

But I'm still a little confused, do you know why the context destructor doesn't also trigger this issue? I would expect the device destructor to always be called after the context destructor since a context contains devices.

Yes, the device destructor should always be called after the context destructor since a context contains devices, but it is NOT now and this is what the existing issue (#17450) is about - that the device destructor is called too early ...

Copy link
Contributor

@ldrumm ldrumm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually seems to align fairly well with some changes I have locally. I'll echo the pools vs provider discussion here though: see comment below

@npmiller
Copy link
Contributor

Yes, there is a reason to move it over. There is an issue in SYCL/UR/CUDA adapter (which occurs in the sycl/test-e2e/Regression/static-buffer-dtor.cpp test) that the destructor of the CUDA device is called before the destructor of a static buffer defined in a user app (this test). The destructor of the CUDA device destroys all CUDA memory pools and later the destructor of a user static buffer tries to free CUDA memory from already destroyed memory pool what causes a segfault. This PR moves destroying CUDA memory pool to the context after freeing CUDA memory of the user static buffer.

Oh I see, that's a good catch!
But I'm still a little confused, do you know why the context destructor doesn't also trigger this issue? I would expect the device destructor to always be called after the context destructor since a context contains devices.

Yes, the device destructor should always be called after the context destructor since a context contains devices, but it is NOT now and this is what the existing issue (#17450) is about - that the device destructor is called too early ...

So moving the pools from the device to the context doesn't really fix anything, it's just that since the UR device doesn't actually do much for the CUDA adapter we get kinda lucky that the context can still cleanup even after the devices were deleted. It's pretty hacky but I suppose maybe it's an okay workaround until we can properly fix the device destructor.

@ldorau
Copy link
Contributor Author

ldorau commented Mar 14, 2025

Yes, there is a reason to move it over. There is an issue in SYCL/UR/CUDA adapter (which occurs in the sycl/test-e2e/Regression/static-buffer-dtor.cpp test) that the destructor of the CUDA device is called before the destructor of a static buffer defined in a user app (this test). The destructor of the CUDA device destroys all CUDA memory pools and later the destructor of a user static buffer tries to free CUDA memory from already destroyed memory pool what causes a segfault. This PR moves destroying CUDA memory pool to the context after freeing CUDA memory of the user static buffer.

Oh I see, that's a good catch!
But I'm still a little confused, do you know why the context destructor doesn't also trigger this issue? I would expect the device destructor to always be called after the context destructor since a context contains devices.

Yes, the device destructor should always be called after the context destructor since a context contains devices, but it is NOT now and this is what the existing issue (#17450) is about - that the device destructor is called too early ...

So moving the pools from the device to the context doesn't really fix anything, it's just that since the UR device doesn't actually do much for the CUDA adapter we get kinda lucky that the context can still cleanup even after the devices were deleted. It's pretty hacky but I suppose maybe it's an okay workaround until we can properly fix the device destructor.

@npmiller This PR is a workaround for the segafult:

There are a segfault and a memory leak in the current code when running the test:

  1. ur_device_handle_t_::~ur_device_handle_t_() destructor is called far too early (see the 17450 issue) and it ...
    a) destroys CUDA memory pools and
    b) destroys the CUDA device.

  2. sycl::~buffer destructor is called and it ...
    a) tries to free CUDA memory from the already destroyed CUDA memory pool (see 1a)
    b) freeing CUDA memory fails, because the CUDA device has already been destroyed (see 1b) - it is a memory leak
    c) calling umfPoolFree(pool, ptr) on the already destroyed CUDA memory pool causes a segfault.

  3. ur_context_handle_t_::~ur_context_handle_t_() destructor is called and it ...
    a) destroys the host memory pool

This pull request is a workaround for this segafult - it moves CUDA device memory pools from ur_device_handle_t_ to ur_context_handle_t_, so there will be only a memory leak caused by the 17450 issue:

  1. ur_device_handle_t_::~ur_device_handle_t_() destructor is called far too early (see the 17450 issue) and it ...
    a) destroys the CUDA device.

  2. sycl::~buffer destructor is called and it ...
    a) tries to free CUDA memory
    b) freeing CUDA memory fails, because the CUDA device has already been destroyed - it is a memory leak
    c) calling umfPoolFree(pool, ptr) just fails.

  3. ur_context_handle_t_::~ur_context_handle_t_() destructor is called and it ...
    a) destroys CUDA memory pools and
    b) destroys the host memory pool

The ideal fix for the 17450 issue should move calling the ur_device_handle_t_::~ur_device_handle_t_() destructor after the ur_context_handle_t_::~ur_context_handle_t_() destructor is called:

  1. sycl::~buffer destructor is called and it ...
    a) frees CUDA memory by calling umfPoolFree(pool, ptr) on the CUDA memory pool

  2. ur_context_handle_t_::~ur_context_handle_t_() destructor is called and it ...
    a) destroys the host memory pool

  3. ur_device_handle_t_::~ur_device_handle_t_() destructor is called at the end and it ...
    a) destroys CUDA memory pools and
    b) destroys the CUDA device.

so the most appropriate/correct location of CUDA device memory pools is the current one (in ur_device_handle_t_), but because of the 17450 issue it causes a segfault now.

@ldorau
Copy link
Contributor Author

ldorau commented Mar 14, 2025

@npmiller is it OK? Are you going to approve this PR?

@ldorau ldorau marked this pull request as draft March 14, 2025 11:33
@ldorau
Copy link
Contributor Author

ldorau commented Mar 14, 2025

Replaced with #17468 (the test has been disabled)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants