-
-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
build-scaleway-x64-ubuntu-16-04-2 "installer" machine is build pipeline bottleneck #952
Comments
We need more machines capable of performing this fuinction - one across docker and installers is not appropriate. For the docker work, only docker is required on the machine, so that should be easy to offload |
System being created - |
Docker images are struggling to connect to external systems, therefore this is not currently working on the machine ... |
@cmdc0de Do you know why GoDaddy provisioned machines appear to have issues with external connectivity to docker containers? We've seen this in issue 721 as well |
A proposed fix is to spin up the ubuntu image by running |
Considering switching away from godaddy for this purpose since I'd rather have a provider that works out of the box - will investigate... |
@Haroon-Khel Do you know if that option can be set as the default so that we don't need to update the scripts to make it work properly? |
@sxa555 Ive looked through the documentation, but I cant seem to find a solution which sets that variable globally/as a default. Theres a way to do it by using Docker Compose files, but I think that would be a slight overkill. Updating our existing build scripts would be our best bet, though this issue affects only our Go Daddy machines yes? |
Correct ... I suppose it would depend how many places we needed to make the change in. May be good to get @dinogun involved at this point to see if adding that option to each docker command is feasible and/or whether he knows of a way to default it globally |
Adding Top priority to this as we're holding up pipelines |
@karianna Is it still holding pipelines up? The original problems was the docker builds chewing all the resources on the machine, and the machine now has two executors to prevent that. |
It's still only one host that we're relying on though right? I think we should get rid of the single point of failure in that case. |
Yes absolutely, but it's not currently holding up pipelines. |
Yesterday, two parallel Docker jobs blocked that machine for hours. There's a thread on Slack in #infrastructure started by Simon. |
OK thanks - I hadn't seen the system in a state where two docker jobs were running on it. That job should be single threaded I suspect as I'm not sure it's safe to run it in parallel. @dinogun can you comment/confirm? [EDIT: Just checked and openjdk_build_docker_multiarch is set not to allow concurrent builds] |
We had looked at moving this to another machine at godaddy but godaddy servers have unresolved issues with their networking in docker images. See also adoptium/temurin-build#1044 where we are covering where we have logged a few single points of failure that exist in the build systems today. |
If two docker jobs of the same type were running in parallel, that would be very strange (and ideally should never happen). Wondering if the for some reason the |
Incorrect for this case - that machine has two executors therefore allows two jobs to run in parallel (which is why I said that despite being locked to this machine, the docker build shouldn't hold up other things as they'll run on the second executor) |
Hmm can we somehow limit only one executor to run all docker related jobs then ? |
I've stopped the multiarch for now. These are not designed to run together as they periodically cleanup all docker images on the box and so would cause both to fail. We need a way to fix the docker jobs to only one executor. |
OK - that hasn't been the case for some time - the builds will likely have been running together unless anything else has changed, although I don't recall anyone reporting it until this week. |
Looking at setting up one or two more machines for this. It will potentially destabilise daily docker image creation on the Linux/x64 machine until we have clear setup instructions for the docker jobs. I've got a setup with the docker keys available and am running a test job on it at the moment. For what are obvious reasons this will take a while ;-) For future reference, the server used for this appears to require at least 8GB of RAM (4GB without swap wasn't enough - I might also try 4GB with a swapfile just to check) |
docker job ran successfully on docker-aws-ubuntu1604-x64-1. It also completed in 6h8m instead of 31 hours for the last completed run on build-scaleway-x64-ubuntu-16-04-2 The manifest job triggered after that build ran on docker-aws-ubuntu1604-x64-2 and completed in 2h12 where the previous runs on the scaleway box were up to 11 hours (maybe due to contention, fastest of recent runs on that system was 5h29) Follow-on job on docker-aws-ubuntu1604-x64-2 has also completed (slightly faster at 4h41 compared to the aforementioned 6h08 on the other new machine). I would not be surprised if these times dropped as the machines re-run the jobs and have more data cached locally I have locked ("Keep this build forever") one multiarch and manifest job from the old machine temporarily so we can compare output if needed) |
@sxa555
This can happen if the auth is missing. Can you please check if the |
Fixed - I'd copied the file over with the wrong name onto that machine - apologies |
Can you do a quick check and see if |
Yep it's fine - I'm also re-running multiarch on x64 (we can restrict the jobs to specific combinations now instead of running all architectures!) to verify it |
I think it would be good if we can modify the scripts to return suitable non-zero exit codes in that situation (and others) in order to make it easier to understand the success or otherwise of the jobn from the jenkins job status. I might have a go at that or assign one of my team to look at it. |
I have some upcoming changes that will fix this. In general better reporting of failures and a summary of which specific docker images failed to build if any. |
Nightly builds are getting bottlenecked on this machine due to it being the only machine capable of installer job due to the require GPG keys. It is not helped by the fact that some very long running jobs also run on it, eg.DockerFile builds taking 4+hrs and openjdk_build_docker_multiarch docker&x64 jobs which take 5-6 hours!! every day!
The text was updated successfully, but these errors were encountered: