-
Notifications
You must be signed in to change notification settings - Fork 730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[UR][L0] Propagate errors from USMAllocationMakeResident
#11312
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @0x12CC . Please move this PR to the UR repo, since code for L0 is now there. instructions for contribution to follow but please check what we are doing here oneapi-src/unified-runtime#901
USMAllocationMakeResident
This change ensures that USM allocation APIs don't return `UR_RESULT_SUCCESS` when an error occurs within `USMAllocationMakeResident`. Signed-off-by: Michael Aziz <michael.aziz@intel.com>
5e661fd
to
fcf7d02
Compare
Signed-off-by: Michael Aziz <michael.aziz@intel.com>
Signed-off-by: Michael Aziz <michael.aziz@intel.com>
Signed-off-by: Michael Aziz <michael.aziz@intel.com>
Signed-off-by: Michael Aziz <michael.aziz@intel.com>
Signed-off-by: Michael Aziz <michael.aziz@intel.com>
Signed-off-by: Michael Aziz <michael.aziz@intel.com>
@jandres742, @intel/dpcpp-l0-pi-reviewers, could anyone please review this PR? |
thanks @0x12CC . Please confirm whether the failures are related to your changes or not. |
Signed-off-by: Michael Aziz <michael.aziz@intel.com>
@jandres742, @kbenzie, there are two sets of failures:
The |
thanks @0x12CC . for the 25 failures, then two things to do:
|
I was not able to reproduce the failures in a Arc system. So to confirm where the error might be coming from, I created draft PRs for the updates on each of the UR commits between current and the one we are trying here: testing update in #11587 commit 4954850f338241d72d506ebcb7162d61afef2924
Merge: 0677296 e5d6a91
Author: Kenneth Benzie (Benie) <k.benzie@codeplay.com>
Date: Tue Oct 10 11:46:13 2023 +0100
Merge pull request #906 from 0x12CC/l0_usm_error_checking
[UR][L0] Propagate errors from `USMAllocationMakeResident` testing update in #11586 commit 484cd0716682b0f7c39ca5c740e19d0e44568342
Author: Nicolas Miller <nicolas.miller@codeplay.com>
Date: Tue Oct 10 11:42:52 2023 +0100
Add code owners for the CUDA and HIP adapters testing update in #11585 commit 0677296778fe29058a4a0dff6d0e30456555c2d1
Merge: b38855e 49770f5
Author: Kenneth Benzie (Benie) <k.benzie@codeplay.com>
Date: Tue Oct 10 11:38:07 2023 +0100
Merge pull request #909 from kbenzie/benie/codeowners
Add CODEOWNERS file testing update in #11590 commit 6180a78e8bff683c8d944686823b45d3fe939468
Author: Andrey Alekseenko <al42and@gmail.com>
Date: Mon Oct 2 16:26:36 2023 +0200
[UR][L0] Fix minor code duplication
- Second return has no effect
- Checking the same flag twice has no effect testing update in #11588 commit 55dccfce3f7371d64cbef7ddc100be39bb5e4c3b
Author: Hugh Delaney <hugh.delaney@codeplay.com>
Date: Tue Oct 10 10:33:02 2023 +0100
Re add extended deleters testing update in #11589 commit 49770f51ac69c485f7ed5c2bc36732e54f8def07
Author: Kenneth Benzie (Benie) <k.benzie@codeplay.com>
Date: Fri Sep 29 14:21:23 2023 +0100
Add CODEOWNERS file
Adapter implementations are being moved into the repo we should mirror
the code ownership of those implementations. This patch adds a
CODEOWNERS file which matches the pre adapter move ownership to be used
as a baseline for future changes for each individual adapter. this is current baseline commit b38855ed815ffd076bfde5e5e06170ca4f723dc1
Merge: e6343f4 6a2c548
Author: Piotr Balcer <piotr.balcer@intel.com>
Date: Thu Oct 5 12:15:42 2023 +0200
Merge pull request #920 from jsji/localcopy
[UR][L0] Copy prebuilt L0 to avoid leaking shared folder path so lets see which PR fails, so at least we know what the culprit is. |
It does appear to be this change that is causing the regression. I think in order to be able to start making progress getting UR changes into intel/llvm we will need to revert this in UR. |
So we tried making the revert and using that in #11344 but the regression persists: https://github.com/intel/llvm/actions/runs/6574062724/job/17858660006?pr=11344 Could we XFAIL this test and investigate it later? In addition to #11344, there's in now also removing the OpenCL adapter and two more UR PR's we've not merged yet that are blocked by this issue. |
@kbenzie : if we have only 1 failure with #11344, then yes, we could ask for that to the @intel/llvm-gatekeepers (@smaslov-intel are the @intel/llvm-gatekeepers the correct to determine if we can mark a test as known failure?). |
Turns out that #11344 was not using the latest changes from the sycl branch. After pulling in those the remaining regression went away so we won't need to XFAIL the test. I'll merge the UR revert oneapi-src/unified-runtime#875 shortly. |
Thanks for your help, @kbenzie, @jandres742. I'm not able to reproduce these failures locally so I'll need to spend some time investigating the issue. I'm going to close this PR until the failures have been analyzed and I can propose a solution. |
thanks @0x12CC . I was not able either on my side. So one theory here would be that we have a data race that is causing this issue, and we might need to try on different types of setups to be able to reproduce it. |
Update the UR tag to include a bug fix in L0.