-
Notifications
You must be signed in to change notification settings - Fork 704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move PowerPC and s390x CI jobs to a separate GitHub account #1678
Comments
More specifically, I suggest we :
|
It is indeed a valid concern. AFAIK there is no support for ppc64le managed GitHub Action runner. That leaves us 2 options:
Both options are slower than a self-hosted ppc64le GitHub Action runners. I would like to invite Timothy Pearson of the RaptorCS @madscientist159 into this conversation. I believe we could work out a sponsored ppc64le host to host GitHub Action runners. Alternatively you we could create a separate repository which basically mirrors the main |
#1687 removes these from
This wouldn't help because resource limits and throttling are per-account and I think ToS prohibits creating accounts just to get around resource limits. |
I will discuss with my team and see what our options are. |
How about creating a totally new account? |
Thanks for the proposal, I will also discuss with my team after local vacation (ends at Oct 7). But I want to confirm if the proposed/ideal solution is to run CI in downstream runners and collect results in upstream runners/actions? |
@briansmith FYI you could apply for long term access to POWER9 VM or Containers from following orgs listed in https://openpowerfoundation.org/hub/ Hosting GitHub Action Runner on those VMs would be the most performant way to go |
Hi @briansmith , on the s390x side I'd be happy to work with my team to set up extra testing if required. Let me try to summarize the options as I currently see them:
Option 1 is what you've been doing so far, and would seem to be the preferable solution, in the absence of the scaling problems you describe. One question is whether there's a way to improve that scaling without significantly compromising the usefulness of the testing. I understand the main problem is the combinatorial explosion of testing "all architectures" x "many configuration options" x "many Rust compiler versions", each with a full build & test cycle, where the test in particular uses qemu on all non-x86 architectures (causing extra time overhead). The question is whether this matrix can reasonably be reduced. I believe it is true that the full matrix does in fact detect problems visible only in one specific combination (this architecture + these features + that Rust version) - but I would expect that the vast majority of cases those would be build problems. So one proposal might be to run the full matrix, but only for build, and then run selected subsets for testing (e.g. one test per architecture - this could be what you're already doing for coverage). This would help with the time spent (much fewer qemu executions), but would still leave the problem of sheer number of action invocations. But one solution here might be to simply combine them - maybe run the full cross-build matrix sequentially in a single action. (Or spread it out a bit, e.g. one action per architecture, to avoid having to install mutiple cross toolchains.) This would not only get you below the maximum number of actions, but would also save time spent on waiting to acquire and set up all those separate action runner instances. In the alternative (or in addition) you might also still set up an action (or set of actions) that continues to run the full matrix, but this wouldn't autmatically run on every PR, but only when manually triggered (e.g. if you suspect that this particular PR is likely to affect other architectures). If this doesn't work for you, we can look into setting up external action runners somewhere else (option 2). These can still be integrated in the main repository so they would e.g. trigger automatically on PR submission. However, someone would have to set up and maintain the host systems for those runners. This could either be you, on machines provided by us, e.g. via the Community Cloud I mentioned earlier - that would allow you to keep full control over systems that are integrated into your main repo. Or else this could be us - which would make life easier for you by taking away that additional administrative overhead. The main benefit of option 2 over option 3 would be that testing is still fully integrated into the PR / CI process. A benefit over option 1 would be that these tests would run natively on the runner host architecture, so we would not need qemu. The one open question I have with this approach is whether it would actually fix the maximum number of actions problem. I'm not sure whether or not actions triggered on external runners count towards that limit - I'd hope not, but I'm not definite on that. If they do count, this approach doesn't appear to help much over option 1. Finally, option 3 would be to just run testing on our architectures completely outside of your main repository. There'd of course be many ways to do that - and in fact I'm not sure that explicitly setting up a "mirror" repo is necessary or even useful to do so. [ In fact, I rather don't want to see a "s390x specific" ring repo anywhere public - that could cause people to think that there's some special s390x code in there and/or that when running ring on s390x they should pull from that repo instead of yours; both of which are impressions I'd really prefer to avoid. ] For other projects (including the Rust compiler itself), we're doing something similar, but there we simply have an automated job on IBM internal machines that regularly (e.g. once a day) checks out the official repo, and runs a build & test cycle locally, sending out emails in case of any failures. The advantage is that this doesn't affect the main repo at all (except for one extra pull a day), and the build & test would again run natively. We could certainly do the same for ring. The main drawback would be that testing would be decoupled from the PR process; if a PR were to break s390x, we'd only know a day after it was merged, and presumably I (or someone on the team) would then open an issue in the main repo describing the regression. Let me know what you think makes most sense, or if you're seeing any other options I might have missed. |
I took powerpc64le and s390x out of the |
I found that Recent changes in the CI configuration have made things more bearable (and cheaper) for me, but also I expect to add another axis or 2 of testing for each target such that there will be almost 2x as many jobs are currently. |
@uweigand Thanks for the amazing contribution to the discussion. Sorry to respond to it with such latency. I would rather not add any non-GitHub runners to this repository's CI. The main reason is security: GitHub themselves recommend that you don't add any runners that you haven't fully secured yourself, because of security risks. And securing runners is very hard. I very much do want as much testing on each PR as possible for these targets. Especially since BoringSSL upstream is somewhat big-endian-unfriendly, endianness issues are a huge hazard, and these targets are the ones that catch the vast majority of endianness hazards. Further, if there is any hope that in the future we may have architecture-specific optimizations for these targets, then we'll need to have them supported in CI at the same level as other targets. Especially as we're in the middle of a huge refactoring project for the next several months that would frequently break architecture-specific codepaths otherwise.
I would definitely love some help with the administration issues, as much as I can get. So, I would like to steer things towards having IBM manage these runners, and not adding them to briansmith/ring account. In terms of the number of jobs, I expect to add at least one or two more dimension(s) to the matrix: constant-time verification and memory safety verification in the manner that BoringSSL upstream does constant-time validation--using valgrind. This requires running the tests in an much-slower configuration than currently. Again, if we have any hope of adding architecture-specific optimizations for these targets, then we'll need to do this validation for these architectures. I suspect that running valgrind wthin QEMU in CI is a non-starter for performance. In the very shortest term, we should probably reorder the build matrix so that |
I can certainly understand that. However ...
... I'm not sure to achieve that goal without integrating (external) runners into this repo.
I think we can do that, however, this would only work as per-PR CI if those external runners we maintain are given access to your repo (which you said above you don't want to do ...). For now, we've set up a daily regression test on an IBM internal system (option 3 above) which checks out "ring" once a day and runs:
which should match the current CI matrix in your repo. So far these have been running without any problem - in case this detects any failures, I will raise an issue accordingly. |
Closing this. First, I think the solution I was hoping for is unlikely to materialize. |
I have for the time being added various PowerPC and s390x jobs to the GitHub Actions configuration for this repo. However, this has greatly inflated the number of jobs that each commit and merge needs to run. I had to severely cut back on other aspects of the test matrix (Rust channels, debug vs. release configurations, etc.) to add these jobs because:
I am happy to support these targets in CI but I would appreciate it if @runlevel5 @ecnelises @q66 @erichte-ibm @uweigand could come up with a plan for moving all the jobs for these targets somewhere else. Perhaps you could set up a separate repo that simple tracks the upstream (this repo's) main branch and runs all the jobs on your targets, perhaps on bare metal instead of in QEMU?
Please send me an email if you want to discuss privately.
The text was updated successfully, but these errors were encountered: