[CI] Don't limit runners working in parallel #8567

bader · 2023-03-07T23:41:32Z

Currently we see a problem with llvm-test-suite checks on CUDA in AWS cloud, which looks like AWS was never allocated to execute the job. Current hypothesis is that we have a matrix job spawning 6 jobs, but we limit number of runners working in parallel to 4. This means that llvm-test-suite check on CUDA might wait for 1 or 2 llvm-test-suite checks on other machines to complete before we can run llvm-test-suite on CUDA machine.
This strategy conflicts with the rules established for AWS machines, which waits for 1 hour after machine is allocated to start the job. If job has not arrived, machine returns to AWS. Since AWS allocation is speicific to each workflow run, once machine is returned to the AWS, GitHub is not able to use it anymore. GitHub waits for 24 hours and kills the workflow due to missing runner.

This change removes the limitation for the number of runners we can use in parallel to run llvm-test-suite. This should fix the problem with missing results on CUDA machines.

Currently we see a problem with llvm-test-suite checks on CUDA in AWS cloud, which looks like AWS was never allocated to execute the job. Current hypothesis is that we have a matrix job spawning 6 jobs, but we limit number of runners working in parallel to 4. This means that llvm-test-suite check on CUDA might wait for 1 or 2 llvm-test-suite checks on other machines to complete before we can run llvm-test-suite on CUDA machine. This strategy conflicts with the rules established for AWS machines, which waits for 1 hour after machine is allocated to start the job. If job has not arrived, machine returns to AWS. Since AWS allocation is speicific to each workflow run, once machine is returned to the AWS, GitHub is not able to use it anymore. GitHub waits for 24 hours and kills the workflow due to missing runner. This change removes the limitation for the number of runners we can use in parallel to run llvm-test-suite. This should fix the problem with missing results on CUDA machines.

bader · 2023-03-07T23:43:52Z

@alexbatashev, do you remember, why did you add max-parallel in 495c300?

alexbatashev · 2023-03-08T10:23:11Z

@alexbatashev, do you remember, why did you add max-parallel in 495c300?

We had limited resources, and not adding max-parallel could have jammed consequent pull-requests if one of the PRs caused tests to hang.

bader · 2023-03-08T16:19:04Z

@alexbatashev, do you remember, why did you add max-parallel in 495c300?

We had limited resources, and not adding max-parallel could have jammed consequent pull-requests if one of the PRs caused tests to hang.

Sorry, I don't get this. max-parallel was set to 4 and it was exactly the size of the matrix, so effectively this setting was matching the default behavior. It played role only when we increased the matrix size.

What kind of problems do you mean by "jammed consequent pull-requests"? According to my understanding, if one of runners stuck on hanging tests, other runners can pick up the job. The stuck job will be killed within 6 hours and enqueued job can sit in the queue for 24 hours.

I'm not sure I understand the problem you were trying to solve with this configuration, so I'm trying to figure out if removing max-parallel is the right approach or I'm missing something.

alexbatashev · 2023-03-08T16:39:02Z

@alexbatashev, do you remember, why did you add max-parallel in 495c300?

We had limited resources, and not adding max-parallel could have jammed consequent pull-requests if one of the PRs caused tests to hang.

Sorry, I don't get this. max-parallel was set to 4 and it was exactly the size of the matrix, so effectively this setting was matching the default behavior. It played role only when we increased the matrix size.

What kind of problems do you mean by "jammed consequent pull-requests"? According to my understanding, if one of runners stuck on hanging tests, other runners can pick up the job. The stuck job will be killed within 6 hours and enqueued job can sit in the queue for 24 hours.

I'm not sure I understand the problem you were trying to solve with this configuration, so I'm trying to figure out if removing max-parallel is the right approach or I'm missing something.

I think, at some point there were more than 4 jobs (or at least we were going to add more). If tests occupy all the available machines and hang, that means, all PRs are stuck for at least 6 hours. That could hurt developer experience when submitting PRs. If you now have more machines available than the test jobs, it is absolutely safe to remove this limit.

bader · 2023-03-08T16:56:14Z

@alexbatashev, do you remember, why did you add max-parallel in 495c300?

We had limited resources, and not adding max-parallel could have jammed consequent pull-requests if one of the PRs caused tests to hang.

Sorry, I don't get this. max-parallel was set to 4 and it was exactly the size of the matrix, so effectively this setting was matching the default behavior. It played role only when we increased the matrix size.
What kind of problems do you mean by "jammed consequent pull-requests"? According to my understanding, if one of runners stuck on hanging tests, other runners can pick up the job. The stuck job will be killed within 6 hours and enqueued job can sit in the queue for 24 hours.
I'm not sure I understand the problem you were trying to solve with this configuration, so I'm trying to figure out if removing max-parallel is the right approach or I'm missing something.

I think, at some point there were more than 4 jobs (or at least we were going to add more). If tests occupy all the available machines and hang, that means, all PRs are stuck for at least 6 hours. That could hurt developer experience when submitting PRs. If you now have more machines available than the test jobs, it is absolutely safe to remove this limit.

If understand it correctly, max-parallel delays enqueuing of jobs. Let's assume we have 4 runners and they are broken so that they hang all jobs for 6 hours. In my opinion, developer experience can be even worse with max-parallel setting.

If job is not enqueued due to max-parallel configuration, GitHub logs are not available and it's not possible to understand why job do not start. When the job in the queue, logs show that it's waiting for a runner.
If runner is provided within 24 hours, the job will fail after 6 hours in both cases. Otherwise, GitHub will kill the job whether it's in the queue or pending other jobs to finish first. There is no difference in developer experience.
This configuration impacts only jobs spawned from a single PR workflow, so another PR can spawn another max-parallel jobs and get "undesirable developer experience". It's not clear how max-parallel really helps in this case.

How does max-parallel help in this case?

alexbatashev · 2023-03-08T17:22:01Z

@bader that is not the case I had in mind. Initially we had 5-ish machines available. If a developer submits a PR, that contains some sort of a deadlock, and there's >= test configs that runners, this PR stalls all other PRs. There's no way to "unhang" the CI from the UI. AFAIK, github will not send SIGKILL to the running process when clicking "cancel". So, there must be a living person with access to the machine to go and reboot that machine. If all the machines are faulty (e.g., due to a driver fault), max-parallel does not help at all.

bader · 2023-03-08T17:42:07Z

@bader that is not the case I had in mind. Initially we had 5-ish machines available. If a developer submits a PR, that contains some sort of a deadlock, and there's >= test configs that runners, this PR stalls all other PRs. There's no way to "unhang" the CI from the UI. AFAIK, github will not send SIGKILL to the running process when clicking "cancel". So, there must be a living person with access to the machine to go and reboot that machine. If all the machines are faulty (e.g., due to a driver fault), max-parallel does not help at all.

According to understanding max-parallel changes the stage where job waits for available resources. By default, it's a queue and max-parallel blocks job even earlier, but both these approaches can't block for more than 24 hours. max-parallel will move the blocked jobs to queue after GitHub will kill "deadlocked" job (i.e. running for 6h).

How does max-parallel help developers? In 99% cases, developer will not notice "deadlock" in tests before GitHub will finish or kill all jobs and send results back.

alexbatashev · 2023-03-08T18:42:32Z

How does max-parallel help developers? In 99% cases, developer will not notice "deadlock" in tests before GitHub will finish or kill all jobs and send results back.

It doesn't help the one who caused the hang. But it should help everyone else.

bader · 2023-03-08T18:49:53Z

But it should help everyone else.

Sorry, for asking the same question, but I still don't understand how this setting helps other developers.
I'll probably leave it by default right now and will think about it when we hit this problem. Right now I don't understand how setting max-parallel helps. 😕

bader · 2023-03-09T02:19:49Z

Talked offline with @aelovikov-intel and he gave me following interpretation of original intention for max-parallel use.

I'm going to use abstract model to simplify the example and demonstrate the usage case.
Let's assume we have following setup: two runners, one workflow with two jobs running in parallel. We have a PR (let's call it "bad PR") that has a "deadlock" bug in each job we in run CI and because both jobs running in parallel, both runners will spin in "deadlock". All PRs launched after our "bad PR" will have to wait GitHub to free runners (i.e. ~6h).
If we set max-parallel to 1 in the workflow and allow one PR to occupy only part of the available runners (1 of 2), other PR will have chance to continue execution even before GitHub kills hang tasks.

@alexbatashev, did I get it correctly that max-parallel setting is intended to guarantee that if one PR blocks allocated runners, there will be other runners to independently process jobs from "good" PRs?

alexbatashev · 2023-03-09T05:20:21Z

@bader you're right

bader · 2023-03-09T05:23:27Z

@alexbatashev, thanks for confirming.

bader requested a review from apstasen March 7, 2023 23:41

bader requested a review from a team as a code owner March 7, 2023 23:41

aelovikov-intel approved these changes Mar 7, 2023

View reviewed changes

bader mentioned this pull request Mar 8, 2023

[SYCL] Fix ICE when -g and -ffile-prefix-map flags are set #8459

Merged

bader temporarily deployed to aws March 8, 2023 02:32 — with GitHub Actions Inactive

apstasen approved these changes Mar 8, 2023

View reviewed changes

bader merged commit 6077e8e into intel:sycl Mar 8, 2023

bader deleted the ci-cuda-fix branch March 8, 2023 04:52

bader temporarily deployed to aws March 9, 2023 12:24 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI] Don't limit runners working in parallel #8567

[CI] Don't limit runners working in parallel #8567

Uh oh!

bader commented Mar 7, 2023

Uh oh!

bader commented Mar 7, 2023

Uh oh!

alexbatashev commented Mar 8, 2023

Uh oh!

bader commented Mar 8, 2023

Uh oh!

alexbatashev commented Mar 8, 2023

Uh oh!

bader commented Mar 8, 2023

Uh oh!

alexbatashev commented Mar 8, 2023

Uh oh!

bader commented Mar 8, 2023

Uh oh!

alexbatashev commented Mar 8, 2023

Uh oh!

bader commented Mar 8, 2023

Uh oh!

bader commented Mar 9, 2023

Uh oh!

alexbatashev commented Mar 9, 2023

Uh oh!

bader commented Mar 9, 2023

Uh oh!

Uh oh!

[CI] Don't limit runners working in parallel #8567

[CI] Don't limit runners working in parallel #8567

Uh oh!

Conversation

bader commented Mar 7, 2023

Uh oh!

bader commented Mar 7, 2023

Uh oh!

alexbatashev commented Mar 8, 2023

Uh oh!

bader commented Mar 8, 2023

Uh oh!

alexbatashev commented Mar 8, 2023

Uh oh!

bader commented Mar 8, 2023

Uh oh!

alexbatashev commented Mar 8, 2023

Uh oh!

bader commented Mar 8, 2023

Uh oh!

alexbatashev commented Mar 8, 2023

Uh oh!

bader commented Mar 8, 2023

Uh oh!

bader commented Mar 9, 2023

Uh oh!

alexbatashev commented Mar 9, 2023

Uh oh!

bader commented Mar 9, 2023

Uh oh!

Uh oh!