Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move PowerPC and s390x CI jobs to a separate GitHub account #1678

Closed
briansmith opened this issue Oct 2, 2023 · 13 comments
Closed

Move PowerPC and s390x CI jobs to a separate GitHub account #1678

briansmith opened this issue Oct 2, 2023 · 13 comments

Comments

@briansmith
Copy link
Owner

I have for the time being added various PowerPC and s390x jobs to the GitHub Actions configuration for this repo. However, this has greatly inflated the number of jobs that each commit and merge needs to run. I had to severely cut back on other aspects of the test matrix (Rust channels, debug vs. release configurations, etc.) to add these jobs because:

  1. GitHub imposes a limit of 256 jobs per matrix.
  2. It was taking over an hour (I believe multiple hours) for each push's jobs to complete in CI.

I am happy to support these targets in CI but I would appreciate it if @runlevel5 @ecnelises @q66 @erichte-ibm @uweigand could come up with a plan for moving all the jobs for these targets somewhere else. Perhaps you could set up a separate repo that simple tracks the upstream (this repo's) main branch and runs all the jobs on your targets, perhaps on bare metal instead of in QEMU?

Please send me an email if you want to discuss privately.

@briansmith
Copy link
Owner Author

More specifically, I suggest we :

@runlevel5
Copy link

It is indeed a valid concern. AFAIK there is no support for ppc64le managed GitHub Action runner. That leaves us 2 options:

  • Cross-compilation on an x86_64 host
  • Compilation under pcp64le QEMU VM

Both options are slower than a self-hosted ppc64le GitHub Action runners.

I would like to invite Timothy Pearson of the RaptorCS @madscientist159 into this conversation. I believe we could work out a sponsored ppc64le host to host GitHub Action runners.

Alternatively you we could create a separate repository which basically mirrors the main brainsmith/ring repo to run CI jobs.

@briansmith
Copy link
Owner Author

#1687 removes these from test-features where they really weren't doing anything useful. I moved powerpc-unknown-liux-gnu into test where s390x and powerpc64le already were.

lternatively you we could create a separate repository which basically mirrors the main brainsmith/ring repo to run CI jobs.

This wouldn't help because resource limits and throttling are per-account and I think ToS prohibits creating accounts just to get around resource limits.

@erichte-ibm
Copy link
Contributor

I will discuss with my team and see what our options are.

@runlevel5
Copy link

This wouldn't help because resource limits and throttling are per-account and I think ToS prohibits creating accounts just to get around resource limits

How about creating a totally new account?

@ecnelises
Copy link
Contributor

Thanks for the proposal, I will also discuss with my team after local vacation (ends at Oct 7).

But I want to confirm if the proposed/ideal solution is to run CI in downstream runners and collect results in upstream runners/actions?

@runlevel5
Copy link

runlevel5 commented Oct 3, 2023

@briansmith FYI you could apply for long term access to POWER9 VM or Containers from following orgs listed in https://openpowerfoundation.org/hub/

Hosting GitHub Action Runner on those VMs would be the most performant way to go

@uweigand
Copy link
Contributor

uweigand commented Oct 3, 2023

Hi @briansmith , on the s390x side I'd be happy to work with my team to set up extra testing if required. Let me try to summarize the options as I currently see them:

  1. Run GitHub actions integrated with the main repository, using standard GitHub runners
  2. Run GitHub actions integrated with the main repository, using external runners hosted elsewhere
  3. Run tests that are not directly integrated with the main repository

Option 1 is what you've been doing so far, and would seem to be the preferable solution, in the absence of the scaling problems you describe. One question is whether there's a way to improve that scaling without significantly compromising the usefulness of the testing. I understand the main problem is the combinatorial explosion of testing "all architectures" x "many configuration options" x "many Rust compiler versions", each with a full build & test cycle, where the test in particular uses qemu on all non-x86 architectures (causing extra time overhead).

The question is whether this matrix can reasonably be reduced. I believe it is true that the full matrix does in fact detect problems visible only in one specific combination (this architecture + these features + that Rust version) - but I would expect that the vast majority of cases those would be build problems. So one proposal might be to run the full matrix, but only for build, and then run selected subsets for testing (e.g. one test per architecture - this could be what you're already doing for coverage).

This would help with the time spent (much fewer qemu executions), but would still leave the problem of sheer number of action invocations. But one solution here might be to simply combine them - maybe run the full cross-build matrix sequentially in a single action. (Or spread it out a bit, e.g. one action per architecture, to avoid having to install mutiple cross toolchains.) This would not only get you below the maximum number of actions, but would also save time spent on waiting to acquire and set up all those separate action runner instances.

In the alternative (or in addition) you might also still set up an action (or set of actions) that continues to run the full matrix, but this wouldn't autmatically run on every PR, but only when manually triggered (e.g. if you suspect that this particular PR is likely to affect other architectures).


If this doesn't work for you, we can look into setting up external action runners somewhere else (option 2). These can still be integrated in the main repository so they would e.g. trigger automatically on PR submission. However, someone would have to set up and maintain the host systems for those runners. This could either be you, on machines provided by us, e.g. via the Community Cloud I mentioned earlier - that would allow you to keep full control over systems that are integrated into your main repo. Or else this could be us - which would make life easier for you by taking away that additional administrative overhead.

The main benefit of option 2 over option 3 would be that testing is still fully integrated into the PR / CI process. A benefit over option 1 would be that these tests would run natively on the runner host architecture, so we would not need qemu.

The one open question I have with this approach is whether it would actually fix the maximum number of actions problem. I'm not sure whether or not actions triggered on external runners count towards that limit - I'd hope not, but I'm not definite on that. If they do count, this approach doesn't appear to help much over option 1.


Finally, option 3 would be to just run testing on our architectures completely outside of your main repository. There'd of course be many ways to do that - and in fact I'm not sure that explicitly setting up a "mirror" repo is necessary or even useful to do so. [ In fact, I rather don't want to see a "s390x specific" ring repo anywhere public - that could cause people to think that there's some special s390x code in there and/or that when running ring on s390x they should pull from that repo instead of yours; both of which are impressions I'd really prefer to avoid. ]

For other projects (including the Rust compiler itself), we're doing something similar, but there we simply have an automated job on IBM internal machines that regularly (e.g. once a day) checks out the official repo, and runs a build & test cycle locally, sending out emails in case of any failures. The advantage is that this doesn't affect the main repo at all (except for one extra pull a day), and the build & test would again run natively.

We could certainly do the same for ring. The main drawback would be that testing would be decoupled from the PR process; if a PR were to break s390x, we'd only know a day after it was merged, and presumably I (or someone on the team) would then open an issue in the main repo describing the regression.


Let me know what you think makes most sense, or if you're seeing any other options I might have missed.

@briansmith
Copy link
Owner Author

I took powerpc64le and s390x out of the test matrix completely, leaving them in coverage. I had to do this immediately because I've already used 100% of my GitHub Actions resources for the month so from today until the 31st I am paying per job.

@briansmith
Copy link
Owner Author

I found that powerpc-unknown-linux-gnu jobs are regularly the slowest, by far in CI, especially powerpc-unknown-linux-gnu, --release, 1.61.0. Perhaps the most immediate help would be to investigate why this is so slow and make it faster.

Recent changes in the CI configuration have made things more bearable (and cheaper) for me, but also I expect to add another axis or 2 of testing for each target such that there will be almost 2x as many jobs are currently.

@briansmith
Copy link
Owner Author

@uweigand Thanks for the amazing contribution to the discussion. Sorry to respond to it with such latency.

I would rather not add any non-GitHub runners to this repository's CI. The main reason is security: GitHub themselves recommend that you don't add any runners that you haven't fully secured yourself, because of security risks. And securing runners is very hard.

I very much do want as much testing on each PR as possible for these targets. Especially since BoringSSL upstream is somewhat big-endian-unfriendly, endianness issues are a huge hazard, and these targets are the ones that catch the vast majority of endianness hazards. Further, if there is any hope that in the future we may have architecture-specific optimizations for these targets, then we'll need to have them supported in CI at the same level as other targets. Especially as we're in the middle of a huge refactoring project for the next several months that would frequently break architecture-specific codepaths otherwise.

[W]e can look into setting up external action runners somewhere else (option 2). These can still be integrated in the main repository so they would e.g. trigger automatically on PR submission. However, someone would have to set up and maintain the host systems for those runners. [....] Or else this could be us - which would make life easier for you by taking away that additional administrative overhead.

I would definitely love some help with the administration issues, as much as I can get. So, I would like to steer things towards having IBM manage these runners, and not adding them to briansmith/ring account.

In terms of the number of jobs, I expect to add at least one or two more dimension(s) to the matrix: constant-time verification and memory safety verification in the manner that BoringSSL upstream does constant-time validation--using valgrind. This requires running the tests in an much-slower configuration than currently. Again, if we have any hope of adding architecture-specific optimizations for these targets, then we'll need to do this validation for these architectures. I suspect that running valgrind wthin QEMU in CI is a non-starter for performance.

In the very shortest term, we should probably reorder the build matrix so that test jobs are run sorted by rust version (in this order: MSRV, stable, nightly) then architecture, so that all architectures are tested using Rust 1.61 (currently) before any Rust Stable tests are run. I believe this would stop 32-bit powerpc using Rust 1.61 from being the long pole in CI, as it currently is.

@uweigand
Copy link
Contributor

I would rather not add any non-GitHub runners to this repository's CI. The main reason is security: GitHub themselves recommend that you don't add any runners that you haven't fully secured yourself, because of security risks. And securing runners is very hard.

I can certainly understand that. However ...

I very much do want as much testing on each PR as possible for these targets.

... I'm not sure to achieve that goal without integrating (external) runners into this repo.

[W]e can look into setting up external action runners somewhere else (option 2). These can still be integrated in the main repository so they would e.g. trigger automatically on PR submission. However, someone would have to set up and maintain the host systems for those runners. [....] Or else this could be us - which would make life easier for you by taking away that additional administrative overhead.

I would definitely love some help with the administration issues, as much as I can get. So, I would like to steer things towards having IBM manage these runners, and not adding them to briansmith/ring account.

I think we can do that, however, this would only work as per-PR CI if those external runners we maintain are given access to your repo (which you said above you don't want to do ...).

For now, we've set up a daily regression test on an IBM internal system (option 3 above) which checks out "ring" once a day and runs:

cargo test
cargo test --release
cargo +1.61.0 test
cargo +1.61.0 test --release
cargo test --release --no-default-features
cargo test --release --features=std,slow_tests
cargo +nightly test --release --no-default-features
cargo +nightly test --release --features=std,slow_tests
cargo +1.61.0 test --release --no-default-features
cargo +1.61.0 test --release --features=std,slow_test

which should match the current CI matrix in your repo.

So far these have been running without any problem - in case this detects any failures, I will raise an issue accordingly.

@briansmith
Copy link
Owner Author

Closing this.

First, I think the solution I was hoping for is unlikely to materialize.
Secondly, I have been optimizing the way the GitHub Actions are defined here so that the performance is not so terrible now.
Thirdly, I think any effort is better spent on optimizing the platform-neutral code and/or tests that are the slowest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants