-
-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dockerhub rate limit broke the www.jenkins.io CI build #4192
Dockerhub rate limit broke the www.jenkins.io CI build #4192
Comments
FYI it broke an https://github.com/jenkinsci/acceptance-test-harness PR build as well, but I was able to successfully retry about an hour and a half later. |
About www.jenkins.io (I'll focus on ATH on a second time):
=> it was prone to happen since we moved to NAT gateways a few month agos. Let us open a PR to run the |
@basil can you confirm that the rate limit issue, with the ATH build, was with the test "additional" Docker images (and not the I'm asking to think about an eventual ACP-like for Docker Engine with a "pull-through" cache as per https://docs.docker.com/docker-hub/mirror/ for ci.jenkins.io |
|
Likely the same issue from a quick look, the actual build logs of the docker image aren't archived though |
Yes, this was a rate limit error while fetching containers for use during tests. I didn't encounter any problems building or fetching the |
I've opened jenkinsci/acceptance-test-harness#1634 to set up authenticated Docker Engine during tests |
Closing as:
Thanks folks! |
Reopening as we saw a collection of |
|
A solution to limit this kind of impact would be for us to run registry pull through caches (see. https://docs.docker.com/docker-hub/mirror/) in the ci.jenkins.io agent networks (all VMs and Linux containers) |
@basil @timja I'm continuing the discussion from jenkinsci/acceptance-test-harness#1640 (comment) but here: I'm not sure how to identify the failure error, I would need help navigating the ATH build and test results. With this I should be more autonomous to find failures, understand them and provide solutions.
|
🤔 what is the reason to use nested containers? (btw DinD is a nightmare to configure regarding |
I see https://github.com/jenkinsci/acceptance-test-harness/blob/4904fec29f49dedca64214757f8a7898ffa9a329/ath-container.sh#L37 and it looks like it is not DinD (e.g. nested container engine) but DonD (Docker on Docker, e.g. sharing the socket) is my understanding correct? |
Yes your understanding is correct. I'm not sure either would need testing |
If it is DonD, then the ACR will be a good solution as the cache through setup is on engine side \o/ |
But I fail to see the relation between these errors and the rate limit :| |
Might be this fix that was just pushed: jenkinsci/acceptance-test-harness@04f64ef |
Ow yeah, this change might fix it! |
|
Thanks @basil for the details. In order to tackle down these HTTP/429, I propose this course of actions:
=> once these setup are in place, we'll look at the result |
@dduportal and I got the ACR option working and have tested on ci.jenkins.io. @dduportal is going to finish off the terraform automation and update the jcasc config. It looks like our users aren't rate limited and were probably hitting some anti-abuse protection, this should help with that and is expected to get rid of any rate limiting issues. It will also mean that anything on ci.jenkins.io Azure doesn't need to login anymore, as the docker daemons are going to have a mirror-registry set to point it at the acr cache |
… inside the Jenkins Azure infrastructure (#794) Related to jenkins-infra/helpdesk#4192 Fixup of 91cf2dc Reference Azure documentation: https://learn.microsoft.com/en-us/azure/container-registry/container-registry-artifact-cache?pivots=development-environment-azure-portal This PR introduces an Azure Container Registry set up as a DockerHub mirror using a "Cache Rule" which mirrors `docker.io/*` to `*` (note: it forbids us to use other caching mechanism!). This registry has the following properties: - Only available in the "sponsorship" subscription - Anonymous pull access (constraint due to Docker pull through cache - moby/moby#30880) - Private network only: since we have anonymous pull policy (see above), then we restrict to only a subset of private networks. It uses ["Azure Private Endpoints"](https://learn.microsoft.com/en-us/azure/private-link/private-endpoint-overview) for this - Note: it implies using Private DNS zones linked to networks. These zone might need to be reused in the future for other private link if required The registry is available for the following (heavy DockerHub users) services (I've only setup the Azure ephemeral VM agents subnets for now) through a combination of (private endpoint with a NIC in the subnet + private DNS zone with automatic records + inbound and outbound NSG rules): - ci.jenkins.io - cert.ci.jenkins.io - trusted.jenkins.io - infra.jenkins.io Azure makes it mandatory to log-in on DockerHub for such a mirror system. As such, we use a distinct token stored in an Azure Keyvault which is "Public Images Read Only" associated to the `jenkinsciinfra` organization to avoid the "application" rate limit (e.g. 5k pull / day / IP) and only have the DockerHub anti-abuse system as upper limit (which seems to be a combination of requests and amount of data). ![Capture d’écran 2024-08-05 à 16 31 38](https://github.com/user-attachments/assets/f04e4c49-3500-4589-b0fc-42b5b1792066) ---- *Testing and approving* This PR is expected to have no changes in the plan as it was applied manually: - End to end testing was done on each controller by: - Starting an Azure VM ephemeral agent using a pipeline replay with correct label - The pipeline tries to resolve the DNS name `dockerhubmirror.azurecr.io` and should resolve to an IP local to the VM subnet - Once the VM is up, checking the connectivity in Azure UI portal (`Network Watcher` -> `Connection troubleshoot`) - Source VM is the agent VM, which name is retrieved from build log - Destination is `https://dockerhubmirror.azurecr.io` <img width="1185" alt="Capture d’écran 2024-08-06 à 10 42 25" src="https://github.com/user-attachments/assets/11d762a6-119c-4e03-b7f0-91072364aaa2"> - The bootstrap must be done in 2 `terraform apply` commands as documented, because the ACR component `CredentialSet` is not supported by Terraform yet (see comments in TF code). Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
Update:
We can now roll back the |
All changes have been reverted, but I'll keep this issue opened until the 13 in case we see other issues |
Update:
My only concern is that some images or tags are still absent unless we explicitly Does it make sense @timja ? Have you already seen this behavior in your own infrastructure? |
Hmm not sure we use it slightly differently and explicitly use the cached version. It won’t show up in the cache unless one pull has been completed: But if it’s increasing in size it’s definitely caching some |
Closing as we did not had any more errors. Feel free to reopen if you see some |
Service(s)
ci.jenkins.io
Summary
The ci.jenkins.io job that builds the www.jenkins.io web site failed its most recent build with the message:
I've restarted the build in hopes that it will not hit the rate limit.
Reproduction steps
The text was updated successfully, but these errors were encountered: