Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GitLab CI testing expansion #1644

Merged
merged 44 commits into from
Jul 12, 2024
Merged

GitLab CI testing expansion #1644

merged 44 commits into from
Jul 12, 2024

Conversation

long58
Copy link
Contributor

@long58 long58 commented May 13, 2024

Summary

  • This PR is a feature
  • It does the following:
    • Adds SYCL CI checks on corona using the custom intel compiler
    • Adds instructions for building and installing the custom intel compiler
    • Adds OpenMP Target offload CI checks on lassen using ibm clang
    • Fixes GitLab cross project testing for RAJAPerf on changes to RAJA develop branch.

In order to accomplish this, I needed to make changes to the radiuss_spack_configs github project to add new compilers, and modify the raja and camp package.py files. I plan to make another pull request there integrating those changes, but for now, my branch of radiuss spack configs (used here) is up to date with the latest version of radiuss_spack_configs/develop.

Also, some (~9) of the OpenMPTarget tests on lassen are not consistent in whether they pass or fail. They will pass sometimes, and fail other times, and I'm really not entirely sure why.

This feature completes all objectives of Github issue #1611

@adrienbernede
Copy link
Member

adrienbernede commented May 15, 2024

@long58 I started reading your PR.

First, thank you for fixing the multi-project! I was actually looking at that today, and couldn’t see what was wrong. I think GitLab changed the syntax at some point to remove those quotes. So, thanks for the help.
Now, I see in RAJAPerf build_and_test.sh script that the multi-project will import RAJA@develop whatever the branch actually passed as "MP_BRANCH". This should be fixed to support any reference used (let’s start with the branch, but I think supporting commit hash would be useful too). I can work on that, but in this PR, we should at least use the branch passed with "MP_BRANCH".

We can see that the status reports a failure named "skipped-draft-pr". I think this will go away if you merge "develop" in you branch.

I’ll complete the review tomorrow.

@long58 long58 self-assigned this May 15, 2024
@rhornung67
Copy link
Member

@long58 good work on this so far. I'll review in more detail tomorrow. For now, I suggest that we remove the OpenMP target testing from this branch. Then, we can work on the OpenMP target stuff in a separate PR. It will take some effort to try various compilers to figure out what works. We've always had difficulty with compiler support for that back-end.

@adrienbernede I think it would be best to test RAJAPerf against the PR branch to be merged. However, it would be sufficient I think to see whether RAJA Perf passes or fails (so we can fix breakages in RAJA Perf) without requiring that check to pass before merging the RAJA branch. Is that what you have in mind?

@adrienbernede
Copy link
Member

@rhornung67 the new data point I find in your comment is that you don’t want to bind the RAJA PR status to the RAJAPerf status. I am find with that, and it can be achieve removing the strategy: depend in the trigger job (see .gitlab-ci.yml in RAJA.)

@adrienbernede
Copy link
Member

For the records, my impression so far is that there is a bug in Spack with the detection of the compiler @long58 added. Clang@19.0.0 does not seem to be an official release. Testing suggests that naming the compiler clang@19.0.0 would be fine (spack would extrapolate) but clang@19.0.0.something is too much.
I haven’t looked further than that yet. If changing the spec to clang@19.0.0 is a reasonable workaround, I suggest you try that.

@long58
Copy link
Contributor Author

long58 commented Jun 27, 2024

That does seem to work based on my testing. I'm fine with doing that for the sake of having it work, but I do think it would be a lot clearer if the extra stuff at the end were there 🤷🏻‍♀️. Since it's a spack bug though, I guess we could always change the name once they fix the bug. I've updated the spec, as well as the compiler def in the radiuss spack configs repo.

@rhornung67
Copy link
Member

@long58 The build started but timed out. We should bump up the allocation time. Also, there are a lot of compiler warnings about unused variables, etc. When this PR passes checks and is good to merge, can you start a new PR branch and try to resolve the compiler warnings. Many appear easy to fix (unused variables, etc.).

Also, I think I found a path forward for adding a OpenMP target offload testing pipeline on lassen. It's suboptimal, but will at least lest us check compilation. I'll start a PR for this.

@long58
Copy link
Contributor Author

long58 commented Jun 28, 2024

@rhornung67 I don't think it timed out actually, I think it just failed for some other reason. It failed in 14min 30 sec, but the timeout limit is 3 hours. Unfortunately, because of all of the compiler warnings, GitLab is truncating the output, so I can't actually see what the problem is. I think I might make a branch off of this one for the compiler warnings and fix some of them, just so I can actually see what the issue is that's killing it in the CI.

@rhornung67
Copy link
Member

@long58 sounds like a good path

@adrienbernede
Copy link
Member

@long58 GitLab is out a.t.m. but I have some information for you:

Logs too long to display:

When a job is too long to display, you can still see the full log by hitting the raw button at the top of the job logs. This will display the full, raw, log.

Time out:

There are several ways for a job to time out it GitLab:

  • Timing out waiting for a runner
  • Timing out because the job lasted too long (the 3 hours you mention)
  • Timing out because the allocation itself timed out.
    Rich was referring to the later. In our pipelines, all the jobs on a given machine run within the same shared allocation defined in .gitlab/custom-jobs-and-variables.yml. For corona, it looks like:
CORONA_SHARED_ALLOC: "--exclusive --time-limit=60m --nodes=1 -o per-resource.count=2"

The time-limit is 60 minutes , shared between all the jobs, but 2 jobs can run at the same time (per-resource.count).
So, if the jobs running before the failing one already had consumed 45m and 30sec, then it is possible that it timed out because of the allocation.

Time out messages:
On corona and tioga, flux is pretty explicit. On lassen, lrun would give you an obscure type 2 error. On ruby and poodle, srun should be rather clear too.

@rhornung67
Copy link
Member

@long58 @adrienbernede note that LC is doing maintenance on CZ GitLab and is expected to be down until Monday morning (7/8)

@long58
Copy link
Contributor Author

long58 commented Jul 8, 2024

@adrienbernede I had tried the complete log before, but it still doesn't show the entire output. It shows the log from the beginning, but ends with "Job's log exceeded limit of 32866304 bytes. Job execution will continue but no more output will be collected."
As for the timeout, I'll try upping the time limit and try again.

@long58
Copy link
Contributor Author

long58 commented Jul 8, 2024

Merging develop into this branch seemed to fix the CI build issues.

@rhornung67
Copy link
Member

Isn't this fun?!?

.gitlab-ci.yml Outdated
@@ -91,7 +91,7 @@ stages:
trigger-rajaperf:
stage: multi-project
rules:
- if: $CI_COMMIT_BRANCH == $MP_BRANCH || $MULTI_PROJECT == "ON" #run only if ...
- if: $CI_COMMIT_BRANCH == $MP_BRANCH || $MULTI_PROJECT == "ON" || $CI_MERGE_REQUEST_TARGET_BRANCH_NAME == "develop" #run only if ...
Copy link
Member

@adrienbernede adrienbernede Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@long58 I think the CI_MERGE_REQUEST... part should be reverted before merging

@adrienbernede
Copy link
Member

@long58 @rhornung67 I am seeing intermittent failures. I restarted the jobs and they passed. I tried the clang@19 SYCL job locally and it passed.
Would it be possible to ignore warnings in the first build attempt of this particular job ? (or at least certain class of warnings ?).

Copy link
Member

@rhornung67 rhornung67 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @adrienbernede

@long58
Copy link
Contributor Author

long58 commented Jul 11, 2024

@adrienbernede I know how to disable warnings for this specific job, but I don't know how to disable them only on the first build attempt. I'll look through the gitlab docs and see if I can find anything like that, unless you know how to do that already?

@adrienbernede
Copy link
Member

adrienbernede commented Jul 11, 2024

@long58 good point. Now that I think about it, I’m in the same situation.
I was thinking about not ignoring the warnings in the second build attempt in the build_and_test script:

$cmake_exe --build . --verbose -j 1

But it would imply ignoring the flags passed at the spack spec level, and I have now idea how to do that in a clean way.
Maybe we could ignore those warnings for that job, period ?

@long58
Copy link
Contributor Author

long58 commented Jul 11, 2024

That's what I was thinking, just ignoring the warnings for the job, until I get my branch that actually fixes the warnings finished. Simply adding a -w to the spack SPEC seems to do the trick. I'll push that.

This is only temporary until the underlying causes can be fixed.
@adrienbernede
Copy link
Member

adrienbernede commented Jul 12, 2024

Looks really good to me.

@rhornung67
Copy link
Member

If this is good to merge, @long58 you have the honor.

@long58
Copy link
Contributor Author

long58 commented Jul 12, 2024

Yay! 🥳

@long58 long58 merged commit 25b4f0d into develop Jul 12, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants