-
Notifications
You must be signed in to change notification settings - Fork 787
[CI] Don't limit runners working in parallel #8567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Currently we see a problem with llvm-test-suite checks on CUDA in AWS cloud, which looks like AWS was never allocated to execute the job. Current hypothesis is that we have a matrix job spawning 6 jobs, but we limit number of runners working in parallel to 4. This means that llvm-test-suite check on CUDA might wait for 1 or 2 llvm-test-suite checks on other machines to complete before we can run llvm-test-suite on CUDA machine. This strategy conflicts with the rules established for AWS machines, which waits for 1 hour after machine is allocated to start the job. If job has not arrived, machine returns to AWS. Since AWS allocation is speicific to each workflow run, once machine is returned to the AWS, GitHub is not able to use it anymore. GitHub waits for 24 hours and kills the workflow due to missing runner. This change removes the limitation for the number of runners we can use in parallel to run llvm-test-suite. This should fix the problem with missing results on CUDA machines.
@alexbatashev, do you remember, why did you add max-parallel in 495c300? |
We had limited resources, and not adding max-parallel could have jammed consequent pull-requests if one of the PRs caused tests to hang. |
Sorry, I don't get this. max-parallel was set to 4 and it was exactly the size of the matrix, so effectively this setting was matching the default behavior. It played role only when we increased the matrix size. What kind of problems do you mean by "jammed consequent pull-requests"? According to my understanding, if one of runners stuck on hanging tests, other runners can pick up the job. The stuck job will be killed within 6 hours and enqueued job can sit in the queue for 24 hours. I'm not sure I understand the problem you were trying to solve with this configuration, so I'm trying to figure out if removing max-parallel is the right approach or I'm missing something. |
I think, at some point there were more than 4 jobs (or at least we were going to add more). If tests occupy all the available machines and hang, that means, all PRs are stuck for at least 6 hours. That could hurt developer experience when submitting PRs. If you now have more machines available than the test jobs, it is absolutely safe to remove this limit. |
If understand it correctly, max-parallel delays enqueuing of jobs. Let's assume we have 4 runners and they are broken so that they hang all jobs for 6 hours. In my opinion, developer experience can be even worse with max-parallel setting.
How does max-parallel help in this case? |
@bader that is not the case I had in mind. Initially we had 5-ish machines available. If a developer submits a PR, that contains some sort of a deadlock, and there's >= test configs that runners, this PR stalls all other PRs. There's no way to "unhang" the CI from the UI. AFAIK, github will not send SIGKILL to the running process when clicking "cancel". So, there must be a living person with access to the machine to go and reboot that machine. If all the machines are faulty (e.g., due to a driver fault), max-parallel does not help at all. |
According to understanding max-parallel changes the stage where job waits for available resources. By default, it's a queue and max-parallel blocks job even earlier, but both these approaches can't block for more than 24 hours. max-parallel will move the blocked jobs to queue after GitHub will kill "deadlocked" job (i.e. running for 6h). How does max-parallel help developers? In 99% cases, developer will not notice "deadlock" in tests before GitHub will finish or kill all jobs and send results back. |
It doesn't help the one who caused the hang. But it should help everyone else. |
Sorry, for asking the same question, but I still don't understand how this setting helps other developers. |
Talked offline with @aelovikov-intel and he gave me following interpretation of original intention for max-parallel use. I'm going to use abstract model to simplify the example and demonstrate the usage case. @alexbatashev, did I get it correctly that max-parallel setting is intended to guarantee that if one PR blocks allocated runners, there will be other runners to independently process jobs from "good" PRs? |
@bader you're right |
@alexbatashev, thanks for confirming. |
Currently we see a problem with llvm-test-suite checks on CUDA in AWS cloud, which looks like AWS was never allocated to execute the job. Current hypothesis is that we have a matrix job spawning 6 jobs, but we limit number of runners working in parallel to 4. This means that llvm-test-suite check on CUDA might wait for 1 or 2 llvm-test-suite checks on other machines to complete before we can run llvm-test-suite on CUDA machine.
This strategy conflicts with the rules established for AWS machines, which waits for 1 hour after machine is allocated to start the job. If job has not arrived, machine returns to AWS. Since AWS allocation is speicific to each workflow run, once machine is returned to the AWS, GitHub is not able to use it anymore. GitHub waits for 24 hours and kills the workflow due to missing runner.
This change removes the limitation for the number of runners we can use in parallel to run llvm-test-suite. This should fix the problem with missing results on CUDA machines.