-
Notifications
You must be signed in to change notification settings - Fork 769
[UR][L0] Avoid calls to destroy interop data structures given loader instability #17543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…loader - When an application transfers ownership of a L0 handle to the UR, this does not prevent the OS from releasing the memory backing the handle when the interop library is torn down. This leads to a situation where the UR is trying to destroy interop data structures that have already been destroyed by the OS. - UR has never known the lifetime of the handle, but with the static L0 Loader, the workaround in the loader to handle this situation and "pretend" the memory was freed is impossible to use. - To fix this issue, avoid calls to destroy interop data structures when the runtime is being torn down when using the static loader. - This avoids calling a function that already did not perform the intended operation and prevents segfaults when the UR tries to destroy interop data structures that have already been destroyed. Signed-off-by: Neil R. Spruit <neil.r.spruit@intel.com>
@nrspruit This seems like a more general problem, not only related to interop. As I understand, this check that we're doing in the loader now behaves differently for static loader: https://github.com/oneapi-src/level-zero/blob/3c938e21d827af014971d69dfd66759c2444e4d0/source/lib/ze_libapi.cpp#L220, correct? Can't we move the |
That won't fix any existing library and it is stopping invalid behavior. We should never be trying to free memory after destruction, this will never 100% work properly. Before we did not free the memory, but exited early. Even if we added a call in the dynamic library, that would not fix any existing dynamic library so the issue would still exist in all loaders in the market until a new loader with the new workaround exists. We will not usually have the same static and dynamic loader so changing the loader dynamic to have another workaround will only fix newer loaders. The teardown handling has always been a workaround due to invalid use cases. An application cannot be using a handle allocated in another library and expect to be able to free it in that application after the parent application has exited. No library supports that, before it "pretended" that the memory was freed, but never actually could free any memory. Preferrably, we just would never allow for destroying L0 interop handles if the application is in at_exit. |
I completely agree, that the best way would be to just never allow calling L0 APIs after destructor is called. Is this something we can do? What I don't quite understand is, why this worked before switching to the static loader. From what I can see, there was already logic based on |
This "worked" before because the call to zeModuleDestory etc exited early because the zelLoaderTeardown was called in the dynamic loader which set the ze_lib::destruction=true. With the static loader, the L0 driver cannot load the "L0 application" to inform the application ie "UR" that the driver is being torndown. As the user of the static loader, it is up to us to avoid calling destroy of inherited handles after atexit has begun otherwise there is no guarantee that the handle will be valid, the OS will never provide that gurantee so we need to avoid calling teardown. previously, the loader still did not allow the destroy to continue to actually destroy anything, it simply returned before trying to read the handle. |
this has been a workaround that only works if the application is linking the loader with dyanmic because the L0 driver would call the ze_lib. Basically, in this situation, the ze_loader.dll is already gone ie: We are calling release on memory when the dyanmic loader is gone, no amount of refcnting can prevent this, loadlibrary and dlopen don't affect anything after atexit() has begun, the refcnt it not honored in linux or windows. |
Will leak checking tools (like valgrind memcheck) report a leak on global interop handles that weren't released manually? |
valgrind wont if they did not before because the memory was never being released even before with the dynamic loader. if we use the L0 free check using the validation layer, that may fail in these use cases because the "call" to destroy is missing, but I believe that should be accepted as an expected missing call in these situations. |
Yes, I agree. Ok, thanks, lgtm in that case. |
43ee23a
to
1190e80
Compare
Signed-off-by: Neil R. Spruit <neil.r.spruit@intel.com>
1190e80
to
2ebc226
Compare
Signed-off-by: Neil R. Spruit <neil.r.spruit@intel.com>
I have a potential patch that will workaround and detect a defunct L0 loader. I am testing it now. If so, I can avoid the memory leak in this case. |
Signed-off-by: Neil R. Spruit <neil.r.spruit@intel.com>
70713ae
to
62aef33
Compare
Signed-off-by: Neil R. Spruit <neil.r.spruit@intel.com>
I found a method that can check during teardown to determine if the loader is stable. This resolves the known issues while enabling release of memory when the loader is stable. |
@intel/llvm-gatekeepers , please merge when available to address this issue seen by users. |
…instability (#17543) - When an application transfers ownership of a L0 handle to the UR, this does not prevent the OS from releasing the memory backing the handle when the interop library is torn down. This leads to a situation where the UR is trying to destroy interop data structures that have already been destroyed by the OS. - UR has never known the lifetime of the handle, but with the static L0 Loader, the workaround in the loader to handle this situation and "pretend" the memory was freed is impossible to use. - To fix this issue, avoid calls to destroy interop data structures when the runtime is being torn down. - This avoids calling a function that already did not perform the intended operation and prevents segfaults when the UR tries to destroy interop data structures that have already been destroyed. --------- Signed-off-by: Neil R. Spruit <neil.r.spruit@intel.com>
Uh oh!
There was an error while loading. Please reload this page.