-
Notifications
You must be signed in to change notification settings - Fork 769
[SYCL][UR][CUDA] Move CUDA device memory pools to the context #17411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][UR][CUDA] Move CUDA device memory pools to the context #17411
Conversation
@igchor @bratpiorka @lukaszstolarczuk please review |
|
||
// Create UMF CUDA memory pools for the device memory (UMF_MEMORY_TYPE_HOST) | ||
// and the shared memory (UMF_MEMORY_TYPE_SHARED) | ||
ur_result_t createDeviceMemoryPools() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason you can't just store a ur_usm_pool_handle_t_
in the context? The ctor of ur_usm_pool_handle_t_
does pretty much exactly the same thing as this and createHostMemoryPool
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ur_usm_pool_handle_t_::ur_usm_pool_handle_t_
stores user Disjoint Pools in the set of pools std::set<ur_usm_pool_handle_t> PoolHandles;
. But I want to store the default Proxy Pools in the vectors
std::vector<umf_memory_pool_handle_t> MemoryDevicePools;
std::vector<umf_memory_pool_handle_t> MemorySharedPools;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so to me it looks like the ur_usm_pool_handle_t_ is implemented incorrectly. Look at how, urUSM*Alloc is implemented: https://github.com/intel/llvm/blob/sycl/unified-runtime/source/adapters/cuda/usm.cpp#L71 We are using the DevicePool, without even looking at the hDevice param. The ur_usm_pool_handle_t_
should consist of a vector of memory pools for device and shared allocations just as you implemented in context right now.
I know fixing this, will require a bit more work but it would make the code much simpler if we could just store ur_usm_pool_handle_t_
in the context (not to mention that right now, custom memory pools doesn't seem to work correctly). I won't block this PR but this should be fixed sooner rather than later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll echo this.
I've a patch locally that creates a pool for different sets of allocation flags (write-combine, non-coherent, etc), and that gets the provider from the pool where needed using umfPoolGetMemoryProvider
.
I'm happy for this to go in as-is for now, though as I can build on top of this work
umf_memory_provider_handle_t memorySharedProvider = nullptr; | ||
UR_CHECK_ERROR(umf::createDeviceMemoryProviders( | ||
Device, &memoryDeviceProvider, &memorySharedProvider)); | ||
|
||
auto UmfDeviceParamsHandle = getUmfParamsHandle( | ||
DisjointPoolConfigs.Configs[usm::DisjointPoolMemType::Device]); | ||
DeviceMemPool = umf::poolMakeUniqueFromOpsProviderHandle( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, just looking at this code - aren't you overwriting the same member variable (DeviceMemPool
) multiple times here?
I think DeviceMemPool
should actually be an array just as you made in context.cpp (addressed by device id). Same is true for SharedMemPool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not my code and I have not changed this code. It was done in such way before. These two pools are added by Context->addPool(this);
to the set of pools.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, yeah, I see they are added to the set of pools, but it looks like there is no copy performed here—we're just adding ur_usm_pool_handle_t
(that is a pointer to ur_usm_pool_handle_t_) to the std::set, and in the next iterations, we overwrite the values, no? As I mentioned in the previous comment, I think this implementation is incorrect and should be fixed eventually.
be92a90
to
ee804fa
Compare
From now on disjoint_pool is part of libumf, instead of being a separate library. Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>
Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>
ee804fa
to
591d28b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm approving the PR, but IMHO the current implementation (not touched by this PR) of user-defined pools is incorrect. Once it's fixed, the code could be simplified even further.
@kbenzie @npmiller @frasercrmck @omarahmed1111 please review |
Is there a reason to move it over? Given how CUDA works I think the device is a better place for the device memory pools. |
Yes, there is a reason to move it over. There is an issue in SYCL/UR/CUDA adapter (which occurs in the I will submit an issue for this soon - when I collect all traces. |
|
Oh I see, that's a good catch! But I'm still a little confused, do you know why the context destructor doesn't also trigger this issue? I would expect the device destructor to always be called after the context destructor since a context contains devices. |
Yes, the device destructor should always be called after the context destructor since a context contains devices, but it is NOT now and this is what the existing issue (#17450) is about - that the device destructor is called too early ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually seems to align fairly well with some changes I have locally. I'll echo the pools vs provider discussion here though: see comment below
So moving the pools from the device to the context doesn't really fix anything, it's just that since the UR device doesn't actually do much for the CUDA adapter we get kinda lucky that the context can still cleanup even after the devices were deleted. It's pretty hacky but I suppose maybe it's an okay workaround until we can properly fix the device destructor. |
@npmiller This PR is a workaround for the segafult: There are a segfault and a memory leak in the current code when running the test:
This pull request is a workaround for this segafult - it moves CUDA device memory pools from
The ideal fix for the 17450 issue should move calling the
so the most appropriate/correct location of CUDA device memory pools is the current one (in |
@npmiller is it OK? Are you going to approve this PR? |
Replaced with #17468 (the test has been disabled) |
Move CUDA device memory pools to the context.
This PR fixes a segfault caused by #17450
Ref: #17450