Skip to content

[UR][L0] Avoid calls to destroy interop data structures given loader instability #17543

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Mar 21, 2025

Conversation

nrspruit
Copy link
Contributor

@nrspruit nrspruit commented Mar 19, 2025

  • When an application transfers ownership of a L0 handle to the UR, this does not prevent the OS from releasing the memory backing the handle when the interop library is torn down. This leads to a situation where the UR is trying to destroy interop data structures that have already been destroyed by the OS.
  • UR has never known the lifetime of the handle, but with the static L0 Loader, the workaround in the loader to handle this situation and "pretend" the memory was freed is impossible to use.
  • To fix this issue, avoid calls to destroy interop data structures when the runtime is being torn down.
  • This avoids calling a function that already did not perform the intended operation and prevents segfaults when the UR tries to destroy interop data structures that have already been destroyed.

…loader

- When an application transfers ownership of a L0 handle to the UR, this
  does not prevent the OS from releasing the memory backing the
  handle when the interop library is torn down. This leads to a
  situation where the UR is trying to destroy interop data structures
  that have already been destroyed by the OS.
- UR has never known the lifetime of the handle, but with the static
  L0 Loader, the workaround in the loader to handle this situation and
  "pretend" the memory was freed is impossible to use.
- To fix this issue, avoid calls to destroy interop data structures
  when the runtime is being torn down when using the static loader.
- This avoids calling a function that already did not perform the
  intended operation and prevents segfaults when the UR tries to
  destroy interop data structures that have already been destroyed.

Signed-off-by: Neil R. Spruit <neil.r.spruit@intel.com>
@igchor
Copy link
Member

igchor commented Mar 20, 2025

@nrspruit This seems like a more general problem, not only related to interop. As I understand, this check that we're doing in the loader now behaves differently for static loader: https://github.com/oneapi-src/level-zero/blob/3c938e21d827af014971d69dfd66759c2444e4d0/source/lib/ze_libapi.cpp#L220, correct?

Can't we move the destruction variable to somewhere in the dynamic loader so that we can track dynamic loader lifetime instead of the static loader?

@nrspruit
Copy link
Contributor Author

@nrspruit This seems like a more general problem, not only related to interop. As I understand, this check that we're doing in the loader now behaves differently for static loader: https://github.com/oneapi-src/level-zero/blob/3c938e21d827af014971d69dfd66759c2444e4d0/source/lib/ze_libapi.cpp#L220, correct?

Can't we move the destruction variable to somewhere in the dynamic loader so that we can track dynamic loader lifetime instead of the static loader?

That won't fix any existing library and it is stopping invalid behavior.

We should never be trying to free memory after destruction, this will never 100% work properly. Before we did not free the memory, but exited early.

Even if we added a call in the dynamic library, that would not fix any existing dynamic library so the issue would still exist in all loaders in the market until a new loader with the new workaround exists. We will not usually have the same static and dynamic loader so changing the loader dynamic to have another workaround will only fix newer loaders.

The teardown handling has always been a workaround due to invalid use cases.

An application cannot be using a handle allocated in another library and expect to be able to free it in that application after the parent application has exited. No library supports that, before it "pretended" that the memory was freed, but never actually could free any memory.

Preferrably, we just would never allow for destroying L0 interop handles if the application is in at_exit.

@igchor
Copy link
Member

igchor commented Mar 20, 2025

@nrspruit This seems like a more general problem, not only related to interop. As I understand, this check that we're doing in the loader now behaves differently for static loader: https://github.com/oneapi-src/level-zero/blob/3c938e21d827af014971d69dfd66759c2444e4d0/source/lib/ze_libapi.cpp#L220, correct?
Can't we move the destruction variable to somewhere in the dynamic loader so that we can track dynamic loader lifetime instead of the static loader?

That won't fix any existing library and it is stopping invalid behavior.

We should never be trying to free memory after destruction, this will never 100% work properly. Before we did not free the memory, but exited early.

Even if we added a call in the dynamic library, that would not fix any existing dynamic library so the issue would still exist in all loaders in the market until a new loader with the new workaround exists. We will not usually have the same static and dynamic loader so changing the loader dynamic to have another workaround will only fix newer loaders.

The teardown handling has always been a workaround due to invalid use cases.

An application cannot be using a handle allocated in another library and expect to be able to free it in that application after the parent application has exited. No library supports that, before it "pretended" that the memory was freed, but never actually could free any memory.

Preferrably, we just would never allow for destroying L0 interop handles if the application is in at_exit.

I completely agree, that the best way would be to just never allow calling L0 APIs after destructor is called. Is this something we can do?

What I don't quite understand is, why this worked before switching to the static loader. From what I can see, there was already logic based on ze_lib::destruction variable. My question is: why can't we leverage that logic (which is present in older loaders)?

@nrspruit
Copy link
Contributor Author

@nrspruit This seems like a more general problem, not only related to interop. As I understand, this check that we're doing in the loader now behaves differently for static loader: https://github.com/oneapi-src/level-zero/blob/3c938e21d827af014971d69dfd66759c2444e4d0/source/lib/ze_libapi.cpp#L220, correct?
Can't we move the destruction variable to somewhere in the dynamic loader so that we can track dynamic loader lifetime instead of the static loader?

That won't fix any existing library and it is stopping invalid behavior.
We should never be trying to free memory after destruction, this will never 100% work properly. Before we did not free the memory, but exited early.
Even if we added a call in the dynamic library, that would not fix any existing dynamic library so the issue would still exist in all loaders in the market until a new loader with the new workaround exists. We will not usually have the same static and dynamic loader so changing the loader dynamic to have another workaround will only fix newer loaders.
The teardown handling has always been a workaround due to invalid use cases.
An application cannot be using a handle allocated in another library and expect to be able to free it in that application after the parent application has exited. No library supports that, before it "pretended" that the memory was freed, but never actually could free any memory.
Preferrably, we just would never allow for destroying L0 interop handles if the application is in at_exit.

I completely agree, that the best way would be to just never allow calling L0 APIs after destructor is called. Is this something we can do?

What I don't quite understand is, why this worked before switching to the static loader. From what I can see, there was already logic based on ze_lib::destruction variable. My question is: why can't we leverage that logic (which is present in older loaders)?

This "worked" before because the call to zeModuleDestory etc exited early because the zelLoaderTeardown was called in the dynamic loader which set the ze_lib::destruction=true.

With the static loader, the L0 driver cannot load the "L0 application" to inform the application ie "UR" that the driver is being torndown. As the user of the static loader, it is up to us to avoid calling destroy of inherited handles after atexit has begun otherwise there is no guarantee that the handle will be valid, the OS will never provide that gurantee so we need to avoid calling teardown.

previously, the loader still did not allow the destroy to continue to actually destroy anything, it simply returned before trying to read the handle.

@nrspruit
Copy link
Contributor Author

What I don't quite understand is, why this worked before switching to the static loader. From what I can see, there was already logic based on ze_lib::destruction variable. My question is: why can't we leverage that logic (which is present in older loaders)?

this has been a workaround that only works if the application is linking the loader with dyanmic because the L0 driver would call the ze_lib.

Basically, in this situation, the ze_loader.dll is already gone ie:
loader dllmain destroy
loader context destroy
---> urKernelRelease
<--- urKernelRelease(.hKernel = 0000019ABA4BC2E0) -> UR_RESULT_SUCCESS;
---> urProgramRelease
<--- urProgramRelease(.hProgram = 0000019ABA4BC640) -> UR_RESULT_SUCCESS;
---> urKernelRelease

We are calling release on memory when the dyanmic loader is gone, no amount of refcnting can prevent this, loadlibrary and dlopen don't affect anything after atexit() has begun, the refcnt it not honored in linux or windows.

@pbalcer
Copy link
Contributor

pbalcer commented Mar 20, 2025

Will leak checking tools (like valgrind memcheck) report a leak on global interop handles that weren't released manually?

@nrspruit
Copy link
Contributor Author

Will leak checking tools (like valgrind memcheck) report a leak on global interop handles that weren't released manually?

valgrind wont if they did not before because the memory was never being released even before with the dynamic loader.

if we use the L0 free check using the validation layer, that may fail in these use cases because the "call" to destroy is missing, but I believe that should be accepted as an expected missing call in these situations.

@pbalcer
Copy link
Contributor

pbalcer commented Mar 20, 2025

if we use the L0 free check using the validation layer, that may fail in these use cases because the "call" to destroy is missing, but I believe that should be accepted as an expected missing call in these situations.

Yes, I agree.

Ok, thanks, lgtm in that case.

@nrspruit nrspruit force-pushed the address_interop_teardown_static branch from 43ee23a to 1190e80 Compare March 20, 2025 21:09
Signed-off-by: Neil R. Spruit <neil.r.spruit@intel.com>
@nrspruit nrspruit force-pushed the address_interop_teardown_static branch from 1190e80 to 2ebc226 Compare March 20, 2025 21:12
@nrspruit nrspruit marked this pull request as ready for review March 20, 2025 21:17
@nrspruit nrspruit requested a review from a team as a code owner March 20, 2025 21:17
Signed-off-by: Neil R. Spruit <neil.r.spruit@intel.com>
@nrspruit nrspruit changed the title [UR][L0] Avoid calls to destroy interop data structures given static … [UR][L0] Avoid calls to destroy interop data structures Mar 20, 2025
@nrspruit nrspruit marked this pull request as draft March 21, 2025 05:11
@nrspruit
Copy link
Contributor Author

I have a potential patch that will workaround and detect a defunct L0 loader. I am testing it now. If so, I can avoid the memory leak in this case.

Signed-off-by: Neil R. Spruit <neil.r.spruit@intel.com>
@nrspruit nrspruit force-pushed the address_interop_teardown_static branch from 70713ae to 62aef33 Compare March 21, 2025 05:33
@nrspruit nrspruit changed the title [UR][L0] Avoid calls to destroy interop data structures [UR][L0] Avoid calls to destroy interop data structures given loader instability Mar 21, 2025
Signed-off-by: Neil R. Spruit <neil.r.spruit@intel.com>
@nrspruit nrspruit marked this pull request as ready for review March 21, 2025 05:50
@nrspruit
Copy link
Contributor Author

I have a potential patch that will workaround and detect a defunct L0 loader. I am testing it now. If so, I can avoid the memory leak in this case.

I found a method that can check during teardown to determine if the loader is stable. This resolves the known issues while enabling release of memory when the loader is stable.

@nrspruit
Copy link
Contributor Author

@intel/llvm-gatekeepers , please merge when available to address this issue seen by users.

@sarnex sarnex merged commit 25e1b55 into intel:sycl Mar 21, 2025
42 of 43 checks passed
KornevNikita pushed a commit that referenced this pull request May 27, 2025
…instability (#17543)

- When an application transfers ownership of a L0 handle to the UR, this
does not prevent the OS from releasing the memory backing the handle
when the interop library is torn down. This leads to a situation where
the UR is trying to destroy interop data structures that have already
been destroyed by the OS.
- UR has never known the lifetime of the handle, but with the static L0
Loader, the workaround in the loader to handle this situation and
"pretend" the memory was freed is impossible to use.
- To fix this issue, avoid calls to destroy interop data structures when
the runtime is being torn down.
- This avoids calling a function that already did not perform the
intended operation and prevents segfaults when the UR tries to destroy
interop data structures that have already been destroyed.

---------

Signed-off-by: Neil R. Spruit <neil.r.spruit@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants