Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UR][L0] Propagate errors from USMAllocationMakeResident #11312

Closed
wants to merge 8 commits into from

Conversation

0x12CC
Copy link
Contributor

@0x12CC 0x12CC commented Sep 26, 2023

Update the UR tag to include a bug fix in L0.

@0x12CC 0x12CC requested a review from a team as a code owner September 26, 2023 20:14
@0x12CC 0x12CC temporarily deployed to WindowsCILock September 26, 2023 20:15 — with GitHub Actions Inactive
@0x12CC 0x12CC temporarily deployed to WindowsCILock September 26, 2023 20:43 — with GitHub Actions Inactive
Copy link
Contributor

@jandres742 jandres742 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @0x12CC . Please move this PR to the UR repo, since code for L0 is now there. instructions for contribution to follow but please check what we are doing here oneapi-src/unified-runtime#901

@0x12CC 0x12CC changed the title [SYCL][L0] Propagate errors from USMAllocationMakeResident [UR][L0] Propagate errors from USMAllocationMakeResident Sep 28, 2023
@0x12CC 0x12CC marked this pull request as draft September 28, 2023 20:12
This change ensures that USM allocation APIs don't return
`UR_RESULT_SUCCESS` when an error occurs within
`USMAllocationMakeResident`.

Signed-off-by: Michael Aziz <michael.aziz@intel.com>
@0x12CC 0x12CC temporarily deployed to WindowsCILock September 28, 2023 20:16 — with GitHub Actions Inactive
@0x12CC 0x12CC temporarily deployed to WindowsCILock September 28, 2023 20:44 — with GitHub Actions Inactive
Signed-off-by: Michael Aziz <michael.aziz@intel.com>
@0x12CC 0x12CC temporarily deployed to WindowsCILock September 29, 2023 13:52 — with GitHub Actions Inactive
@0x12CC 0x12CC temporarily deployed to WindowsCILock September 29, 2023 14:23 — with GitHub Actions Inactive
Signed-off-by: Michael Aziz <michael.aziz@intel.com>
@0x12CC 0x12CC temporarily deployed to WindowsCILock September 29, 2023 15:11 — with GitHub Actions Inactive
@0x12CC 0x12CC temporarily deployed to WindowsCILock September 29, 2023 16:16 — with GitHub Actions Inactive
Signed-off-by: Michael Aziz <michael.aziz@intel.com>
@0x12CC 0x12CC marked this pull request as ready for review October 10, 2023 14:45
Signed-off-by: Michael Aziz <michael.aziz@intel.com>
@0x12CC 0x12CC temporarily deployed to WindowsCILock October 10, 2023 15:00 — with GitHub Actions Inactive
@0x12CC 0x12CC temporarily deployed to WindowsCILock October 10, 2023 15:24 — with GitHub Actions Inactive
@0x12CC 0x12CC closed this Oct 16, 2023
@0x12CC 0x12CC reopened this Oct 16, 2023
Signed-off-by: Michael Aziz <michael.aziz@intel.com>
Signed-off-by: Michael Aziz <michael.aziz@intel.com>
@0x12CC 0x12CC temporarily deployed to WindowsCILock October 16, 2023 20:37 — with GitHub Actions Inactive
@0x12CC
Copy link
Contributor Author

0x12CC commented Oct 16, 2023

@jandres742, @intel/dpcpp-l0-pi-reviewers, could anyone please review this PR?

@jandres742
Copy link
Contributor

thanks @0x12CC . Please confirm whether the failures are related to your changes or not.

@0x12CC 0x12CC temporarily deployed to WindowsCILock October 16, 2023 21:16 — with GitHub Actions Inactive
Signed-off-by: Michael Aziz <michael.aziz@intel.com>
@0x12CC 0x12CC temporarily deployed to WindowsCILock October 17, 2023 14:39 — with GitHub Actions Inactive
@0x12CC 0x12CC temporarily deployed to WindowsCILock October 17, 2023 15:10 — with GitHub Actions Inactive
@kbenzie
Copy link
Contributor

kbenzie commented Oct 17, 2023

thanks @0x12CC . Please confirm whether the failures are related to your changes or not.

This is also affecting #11344 when bumps UR to a newer commit than this PR. Any idea what's causing this? Was the issue somehow hidden before?

@0x12CC
Copy link
Contributor Author

0x12CC commented Oct 17, 2023

@jandres742, @kbenzie, there are two sets of failures:

Failed Tests (25):
  SYCL :: Matrix/Legacy/XMX8/element_wise_all_ops_bf16.cpp
  SYCL :: Matrix/Legacy/XMX8/element_wise_all_ops_half.cpp
  SYCL :: Matrix/Legacy/XMX8/element_wise_irreg_sum_rows.cpp
  SYCL :: Matrix/Legacy/XMX8/element_wise_ops.cpp
  SYCL :: Matrix/Legacy/XMX8/joint_matrix_bf16.cpp
  SYCL :: Matrix/Legacy/XMX8/joint_matrix_bfloat16.cpp
  SYCL :: Matrix/Legacy/XMX8/joint_matrix_half.cpp
  SYCL :: Matrix/Legacy/XMX8/joint_matrix_ss_int8.cpp
  SYCL :: Matrix/Legacy/XMX8/joint_matrix_su_int8.cpp
  SYCL :: Matrix/Legacy/XMX8/joint_matrix_us_int8.cpp
  SYCL :: Matrix/Legacy/XMX8/joint_matrix_uu_int8.cpp
  SYCL :: Matrix/XMX8/element_wise_abc.cpp
  SYCL :: Matrix/XMX8/element_wise_all_ops_half.cpp
  SYCL :: Matrix/XMX8/element_wise_ops.cpp
  SYCL :: Matrix/XMX8/get_coord_float_matC.cpp
  SYCL :: Matrix/XMX8/get_coord_int8_matA.cpp
  SYCL :: Matrix/XMX8/joint_matrix_all_sizes.cpp
  SYCL :: Matrix/XMX8/joint_matrix_apply_bf16.cpp
  SYCL :: Matrix/XMX8/joint_matrix_bfloat16.cpp
  SYCL :: Matrix/XMX8/joint_matrix_bfloat16_array.cpp
  SYCL :: Matrix/XMX8/joint_matrix_half.cpp
  SYCL :: Matrix/XMX8/joint_matrix_ss_int8.cpp
  SYCL :: Matrix/XMX8/joint_matrix_su_int8.cpp
  SYCL :: Matrix/XMX8/joint_matrix_us_int8.cpp
  SYCL :: Matrix/XMX8/joint_matrix_uu_int8.cpp

Failed Tests (1):
  SYCL :: DeviceLib/built-ins/ext_native_math.cpp

The ext_native_math failure is related to #11562. I'm not sure what's causing the matrix test failures. I don't think they're related to my change in UR but they could be caused by any other commit included in this version update.

@jandres742
Copy link
Contributor

@jandres742, @kbenzie, there are two sets of failures:

Failed Tests (25):
  SYCL :: Matrix/Legacy/XMX8/element_wise_all_ops_bf16.cpp
  SYCL :: Matrix/Legacy/XMX8/element_wise_all_ops_half.cpp
  SYCL :: Matrix/Legacy/XMX8/element_wise_irreg_sum_rows.cpp
  SYCL :: Matrix/Legacy/XMX8/element_wise_ops.cpp
  SYCL :: Matrix/Legacy/XMX8/joint_matrix_bf16.cpp
  SYCL :: Matrix/Legacy/XMX8/joint_matrix_bfloat16.cpp
  SYCL :: Matrix/Legacy/XMX8/joint_matrix_half.cpp
  SYCL :: Matrix/Legacy/XMX8/joint_matrix_ss_int8.cpp
  SYCL :: Matrix/Legacy/XMX8/joint_matrix_su_int8.cpp
  SYCL :: Matrix/Legacy/XMX8/joint_matrix_us_int8.cpp
  SYCL :: Matrix/Legacy/XMX8/joint_matrix_uu_int8.cpp
  SYCL :: Matrix/XMX8/element_wise_abc.cpp
  SYCL :: Matrix/XMX8/element_wise_all_ops_half.cpp
  SYCL :: Matrix/XMX8/element_wise_ops.cpp
  SYCL :: Matrix/XMX8/get_coord_float_matC.cpp
  SYCL :: Matrix/XMX8/get_coord_int8_matA.cpp
  SYCL :: Matrix/XMX8/joint_matrix_all_sizes.cpp
  SYCL :: Matrix/XMX8/joint_matrix_apply_bf16.cpp
  SYCL :: Matrix/XMX8/joint_matrix_bfloat16.cpp
  SYCL :: Matrix/XMX8/joint_matrix_bfloat16_array.cpp
  SYCL :: Matrix/XMX8/joint_matrix_half.cpp
  SYCL :: Matrix/XMX8/joint_matrix_ss_int8.cpp
  SYCL :: Matrix/XMX8/joint_matrix_su_int8.cpp
  SYCL :: Matrix/XMX8/joint_matrix_us_int8.cpp
  SYCL :: Matrix/XMX8/joint_matrix_uu_int8.cpp

Failed Tests (1):
  SYCL :: DeviceLib/built-ins/ext_native_math.cpp

The ext_native_math failure is related to #11562. I'm not sure what's causing the matrix test failures. I don't think they're related to my change in UR but they could be caused by any other commit included in this version update.

thanks @0x12CC . for the 25 failures, then two things to do:

  1. could you run one of the test with and without your changes in the UR, on top of the commit you are trying to update here in the CMakeLists? that would help us knowing whether the failures are specific to your changes.
  2. if they are not specific to your changes, I would suggest then to test with different commits between the prev version update and the one being updated there, to find which UR update is causing the problems.

@jandres742
Copy link
Contributor

jandres742 commented Oct 19, 2023

@kbenzie

I was not able to reproduce the failures in a Arc system. So to confirm where the error might be coming from, I created draft PRs for the updates on each of the UR commits between current and the one we are trying here:

testing update in #11587

commit 4954850f338241d72d506ebcb7162d61afef2924
Merge: 0677296 e5d6a91
Author: Kenneth Benzie (Benie) <k.benzie@codeplay.com>
Date:   Tue Oct 10 11:46:13 2023 +0100

    Merge pull request #906 from 0x12CC/l0_usm_error_checking

    [UR][L0] Propagate errors from `USMAllocationMakeResident`

testing update in #11586

commit 484cd0716682b0f7c39ca5c740e19d0e44568342
Author: Nicolas Miller <nicolas.miller@codeplay.com>
Date:   Tue Oct 10 11:42:52 2023 +0100

    Add code owners for the CUDA and HIP adapters

testing update in #11585

commit 0677296778fe29058a4a0dff6d0e30456555c2d1
Merge: b38855e 49770f5
Author: Kenneth Benzie (Benie) <k.benzie@codeplay.com>
Date:   Tue Oct 10 11:38:07 2023 +0100

    Merge pull request #909 from kbenzie/benie/codeowners

    Add CODEOWNERS file

testing update in #11590

commit 6180a78e8bff683c8d944686823b45d3fe939468
Author: Andrey Alekseenko <al42and@gmail.com>
Date:   Mon Oct 2 16:26:36 2023 +0200

    [UR][L0] Fix minor code duplication

    - Second return has no effect
    - Checking the same flag twice has no effect

testing update in #11588

commit 55dccfce3f7371d64cbef7ddc100be39bb5e4c3b
Author: Hugh Delaney <hugh.delaney@codeplay.com>
Date:   Tue Oct 10 10:33:02 2023 +0100

    Re add extended deleters

testing update in #11589

commit 49770f51ac69c485f7ed5c2bc36732e54f8def07
Author: Kenneth Benzie (Benie) <k.benzie@codeplay.com>
Date:   Fri Sep 29 14:21:23 2023 +0100

    Add CODEOWNERS file

    Adapter implementations are being moved into the repo we should mirror
    the code ownership of those implementations. This patch adds a
    CODEOWNERS file which matches the pre adapter move ownership to be used
    as a baseline for future changes for each individual adapter.

this is current baseline

commit b38855ed815ffd076bfde5e5e06170ca4f723dc1
Merge: e6343f4 6a2c548
Author: Piotr Balcer <piotr.balcer@intel.com>
Date:   Thu Oct 5 12:15:42 2023 +0200

    Merge pull request #920 from jsji/localcopy

    [UR][L0] Copy prebuilt L0 to avoid leaking shared folder path

so lets see which PR fails, so at least we know what the culprit is.

@kbenzie
Copy link
Contributor

kbenzie commented Oct 19, 2023

testing update in #11587

commit 4954850f338241d72d506ebcb7162d61afef2924
Merge: 0677296 e5d6a91
Author: Kenneth Benzie (Benie) <k.benzie@codeplay.com>
Date:   Tue Oct 10 11:46:13 2023 +0100

    Merge pull request #906 from 0x12CC/l0_usm_error_checking

    [UR][L0] Propagate errors from `USMAllocationMakeResident`

It does appear to be this change that is causing the regression. I think in order to be able to start making progress getting UR changes into intel/llvm we will need to revert this in UR.

@kbenzie
Copy link
Contributor

kbenzie commented Oct 19, 2023

It does appear to be this change that is causing the regression. I think in order to be able to start making progress getting UR changes into intel/llvm we will need to revert this in UR.

So we tried making the revert and using that in #11344 but the regression persists: https://github.com/intel/llvm/actions/runs/6574062724/job/17858660006?pr=11344

Could we XFAIL this test and investigate it later?

In addition to #11344, there's in now also removing the OpenCL adapter and two more UR PR's we've not merged yet that are blocked by this issue.

@jandres742
Copy link
Contributor

It does appear to be this change that is causing the regression. I think in order to be able to start making progress getting UR changes into intel/llvm we will need to revert this in UR.

So we tried making the revert and using that in #11344 but the regression persists: https://github.com/intel/llvm/actions/runs/6574062724/job/17858660006?pr=11344

Could we XFAIL this test and investigate it later?

In addition to #11344, there's in now also removing the OpenCL adapter and two more UR PR's we've not merged yet that are blocked by this issue.

@kbenzie : if we have only 1 failure with #11344, then yes, we could ask for that to the @intel/llvm-gatekeepers (@smaslov-intel are the @intel/llvm-gatekeepers the correct to determine if we can mark a test as known failure?).

@kbenzie
Copy link
Contributor

kbenzie commented Oct 20, 2023

@kbenzie : if we have only 1 failure with #11344, then yes, we could ask for that to the @intel/llvm-gatekeepers (@smaslov-intel are the @intel/llvm-gatekeepers the correct to determine if we can mark a test as known failure?).

Turns out that #11344 was not using the latest changes from the sycl branch. After pulling in those the remaining regression went away so we won't need to XFAIL the test.

I'll merge the UR revert oneapi-src/unified-runtime#875 shortly.

@0x12CC
Copy link
Contributor Author

0x12CC commented Oct 20, 2023

Thanks for your help, @kbenzie, @jandres742. I'm not able to reproduce these failures locally so I'll need to spend some time investigating the issue. I'm going to close this PR until the failures have been analyzed and I can propose a solution.

@0x12CC 0x12CC closed this Oct 20, 2023
@jandres742
Copy link
Contributor

Thanks for your help, @kbenzie, @jandres742. I'm not able to reproduce these failures locally so I'll need to spend some time investigating the issue. I'm going to close this PR until the failures have been analyzed and I can propose a solution.

thanks @0x12CC . I was not able either on my side. So one theory here would be that we have a data race that is causing this issue, and we might need to try on different types of setups to be able to reproduce it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants