Skip to content

[CI] Don't limit runners working in parallel #8567

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 8, 2023
Merged

Conversation

bader
Copy link
Contributor

@bader bader commented Mar 7, 2023

Currently we see a problem with llvm-test-suite checks on CUDA in AWS cloud, which looks like AWS was never allocated to execute the job. Current hypothesis is that we have a matrix job spawning 6 jobs, but we limit number of runners working in parallel to 4. This means that llvm-test-suite check on CUDA might wait for 1 or 2 llvm-test-suite checks on other machines to complete before we can run llvm-test-suite on CUDA machine.
This strategy conflicts with the rules established for AWS machines, which waits for 1 hour after machine is allocated to start the job. If job has not arrived, machine returns to AWS. Since AWS allocation is speicific to each workflow run, once machine is returned to the AWS, GitHub is not able to use it anymore. GitHub waits for 24 hours and kills the workflow due to missing runner.

This change removes the limitation for the number of runners we can use in parallel to run llvm-test-suite. This should fix the problem with missing results on CUDA machines.

Currently we see a problem with llvm-test-suite checks on CUDA in AWS
cloud, which looks like AWS was never allocated to execute the job.
Current hypothesis is that we have a matrix job spawning 6 jobs, but we
limit number of runners working in parallel to 4. This means that
llvm-test-suite check on CUDA might wait for 1 or 2 llvm-test-suite
checks on other machines to complete before we can run llvm-test-suite
on CUDA machine.
This strategy conflicts with the rules established for AWS machines,
which waits for 1 hour after machine is allocated to start the job. If
job has not arrived, machine returns to AWS. Since AWS allocation is
speicific to each workflow run, once machine is returned to the AWS,
GitHub is not able to use it anymore. GitHub waits for 24 hours and
kills the workflow due to missing runner.

This change removes the limitation for the number of runners we can use
in parallel to run llvm-test-suite. This should fix the problem with
missing results on CUDA machines.
@bader bader requested a review from apstasen March 7, 2023 23:41
@bader bader requested a review from a team as a code owner March 7, 2023 23:41
@bader
Copy link
Contributor Author

bader commented Mar 7, 2023

@alexbatashev, do you remember, why did you add max-parallel in 495c300?

@bader bader temporarily deployed to aws March 8, 2023 02:32 — with GitHub Actions Inactive
@bader bader merged commit 6077e8e into intel:sycl Mar 8, 2023
@bader bader deleted the ci-cuda-fix branch March 8, 2023 04:52
@alexbatashev
Copy link
Contributor

@alexbatashev, do you remember, why did you add max-parallel in 495c300?

We had limited resources, and not adding max-parallel could have jammed consequent pull-requests if one of the PRs caused tests to hang.

@bader
Copy link
Contributor Author

bader commented Mar 8, 2023

@alexbatashev, do you remember, why did you add max-parallel in 495c300?

We had limited resources, and not adding max-parallel could have jammed consequent pull-requests if one of the PRs caused tests to hang.

Sorry, I don't get this. max-parallel was set to 4 and it was exactly the size of the matrix, so effectively this setting was matching the default behavior. It played role only when we increased the matrix size.

What kind of problems do you mean by "jammed consequent pull-requests"? According to my understanding, if one of runners stuck on hanging tests, other runners can pick up the job. The stuck job will be killed within 6 hours and enqueued job can sit in the queue for 24 hours.

I'm not sure I understand the problem you were trying to solve with this configuration, so I'm trying to figure out if removing max-parallel is the right approach or I'm missing something.

@alexbatashev
Copy link
Contributor

@alexbatashev, do you remember, why did you add max-parallel in 495c300?

We had limited resources, and not adding max-parallel could have jammed consequent pull-requests if one of the PRs caused tests to hang.

Sorry, I don't get this. max-parallel was set to 4 and it was exactly the size of the matrix, so effectively this setting was matching the default behavior. It played role only when we increased the matrix size.

What kind of problems do you mean by "jammed consequent pull-requests"? According to my understanding, if one of runners stuck on hanging tests, other runners can pick up the job. The stuck job will be killed within 6 hours and enqueued job can sit in the queue for 24 hours.

I'm not sure I understand the problem you were trying to solve with this configuration, so I'm trying to figure out if removing max-parallel is the right approach or I'm missing something.

I think, at some point there were more than 4 jobs (or at least we were going to add more). If tests occupy all the available machines and hang, that means, all PRs are stuck for at least 6 hours. That could hurt developer experience when submitting PRs. If you now have more machines available than the test jobs, it is absolutely safe to remove this limit.

@bader
Copy link
Contributor Author

bader commented Mar 8, 2023

@alexbatashev, do you remember, why did you add max-parallel in 495c300?

We had limited resources, and not adding max-parallel could have jammed consequent pull-requests if one of the PRs caused tests to hang.

Sorry, I don't get this. max-parallel was set to 4 and it was exactly the size of the matrix, so effectively this setting was matching the default behavior. It played role only when we increased the matrix size.
What kind of problems do you mean by "jammed consequent pull-requests"? According to my understanding, if one of runners stuck on hanging tests, other runners can pick up the job. The stuck job will be killed within 6 hours and enqueued job can sit in the queue for 24 hours.
I'm not sure I understand the problem you were trying to solve with this configuration, so I'm trying to figure out if removing max-parallel is the right approach or I'm missing something.

I think, at some point there were more than 4 jobs (or at least we were going to add more). If tests occupy all the available machines and hang, that means, all PRs are stuck for at least 6 hours. That could hurt developer experience when submitting PRs. If you now have more machines available than the test jobs, it is absolutely safe to remove this limit.

If understand it correctly, max-parallel delays enqueuing of jobs. Let's assume we have 4 runners and they are broken so that they hang all jobs for 6 hours. In my opinion, developer experience can be even worse with max-parallel setting.

  1. If job is not enqueued due to max-parallel configuration, GitHub logs are not available and it's not possible to understand why job do not start. When the job in the queue, logs show that it's waiting for a runner.
  2. If runner is provided within 24 hours, the job will fail after 6 hours in both cases. Otherwise, GitHub will kill the job whether it's in the queue or pending other jobs to finish first. There is no difference in developer experience.
  3. This configuration impacts only jobs spawned from a single PR workflow, so another PR can spawn another max-parallel jobs and get "undesirable developer experience". It's not clear how max-parallel really helps in this case.

How does max-parallel help in this case?

@alexbatashev
Copy link
Contributor

@bader that is not the case I had in mind. Initially we had 5-ish machines available. If a developer submits a PR, that contains some sort of a deadlock, and there's >= test configs that runners, this PR stalls all other PRs. There's no way to "unhang" the CI from the UI. AFAIK, github will not send SIGKILL to the running process when clicking "cancel". So, there must be a living person with access to the machine to go and reboot that machine. If all the machines are faulty (e.g., due to a driver fault), max-parallel does not help at all.

@bader
Copy link
Contributor Author

bader commented Mar 8, 2023

@bader that is not the case I had in mind. Initially we had 5-ish machines available. If a developer submits a PR, that contains some sort of a deadlock, and there's >= test configs that runners, this PR stalls all other PRs. There's no way to "unhang" the CI from the UI. AFAIK, github will not send SIGKILL to the running process when clicking "cancel". So, there must be a living person with access to the machine to go and reboot that machine. If all the machines are faulty (e.g., due to a driver fault), max-parallel does not help at all.

According to understanding max-parallel changes the stage where job waits for available resources. By default, it's a queue and max-parallel blocks job even earlier, but both these approaches can't block for more than 24 hours. max-parallel will move the blocked jobs to queue after GitHub will kill "deadlocked" job (i.e. running for 6h).

How does max-parallel help developers? In 99% cases, developer will not notice "deadlock" in tests before GitHub will finish or kill all jobs and send results back.

@alexbatashev
Copy link
Contributor

How does max-parallel help developers? In 99% cases, developer will not notice "deadlock" in tests before GitHub will finish or kill all jobs and send results back.

It doesn't help the one who caused the hang. But it should help everyone else.

@bader
Copy link
Contributor Author

bader commented Mar 8, 2023

But it should help everyone else.

Sorry, for asking the same question, but I still don't understand how this setting helps other developers.
I'll probably leave it by default right now and will think about it when we hit this problem. Right now I don't understand how setting max-parallel helps. 😕

@bader
Copy link
Contributor Author

bader commented Mar 9, 2023

Talked offline with @aelovikov-intel and he gave me following interpretation of original intention for max-parallel use.

I'm going to use abstract model to simplify the example and demonstrate the usage case.
Let's assume we have following setup: two runners, one workflow with two jobs running in parallel. We have a PR (let's call it "bad PR") that has a "deadlock" bug in each job we in run CI and because both jobs running in parallel, both runners will spin in "deadlock". All PRs launched after our "bad PR" will have to wait GitHub to free runners (i.e. ~6h).
If we set max-parallel to 1 in the workflow and allow one PR to occupy only part of the available runners (1 of 2), other PR will have chance to continue execution even before GitHub kills hang tasks.

@alexbatashev, did I get it correctly that max-parallel setting is intended to guarantee that if one PR blocks allocated runners, there will be other runners to independently process jobs from "good" PRs?

@alexbatashev
Copy link
Contributor

@bader you're right

@bader
Copy link
Contributor Author

bader commented Mar 9, 2023

@alexbatashev, thanks for confirming.

@bader bader temporarily deployed to aws March 9, 2023 12:24 — with GitHub Actions Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants