Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

image-promotion hits 429 quota limits #1271

Open
chrischdi opened this issue Apr 5, 2024 · 16 comments
Open

image-promotion hits 429 quota limits #1271

chrischdi opened this issue Apr 5, 2024 · 16 comments
Labels
area/release-eng Issues or PRs related to the Release Engineering subproject kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/release Categorizes an issue or PR as relevant to SIG Release.

Comments

@chrischdi
Copy link
Member

What happened:

  • Image promotion job did run
  • Image promotion failed due tounexpected status code 429 Too Many Requests

See https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/post-k8sio-image-promo/1776261613632884736

What you expected to happen:

  • Image promotion to succeed

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

This issue did already occur in the past and was reported wrongly at

Ben pointed that:

The image promoter makes a really high amount of API calls because of the approach to image signatures.
We have not changed the quotas in the infrastructure projects.

So there may be potential to optimise promo-tools to not require that much API calls and to not exceed the limit.

Environment:

See the prowjob :-)

  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Others:
@chrischdi chrischdi added area/release-eng Issues or PRs related to the Release Engineering subproject kind/bug Categorizes issue or PR as related to a bug. sig/release Categorizes an issue or PR as relevant to SIG Release. labels Apr 5, 2024
@chrischdi
Copy link
Member Author

chrischdi commented Apr 8, 2024

I did try to look through the code a bit:

  • kpromo normally uses a rate-limiter when using the crane library
  • when using sigs.k8s.io/release-sdk/sign, to e.g. signAndReplicate (here) , kpromo does not set the transport to add the rate-limiter, because release-sdk does not allow us to.

@chrischdi
Copy link
Member Author

Instead of adding rate-limiting, the other possibility would be take a look into release-sdk and/or cosign to improve the api calls made.

@xmudrii
Copy link
Member

xmudrii commented Apr 8, 2024

This is a known issue and we're planning a larger refactor of the promo-tools code base, see other issues in this repo for more information.

@sbueringer
Copy link
Member

sbueringer commented Apr 8, 2024

This is a known issue and we're planning a larger refactor of the promo-tools code base, see other issues in this repo for more information.

What is the recommended action when our image promotions are failing with this error? I'm wondering how our users will be affected.

@xmudrii
Copy link
Member

xmudrii commented Apr 8, 2024

What is the recommended action when our image promotions are failing with this error? I'm wondering how our users will be affected.

If promotion fails with error such as:

run `cip run`: promote images: signing images: replicating signatures: copying signature ...

It's generally safe to ignore it. If it fails with any other error, the job should be restarted. You can ping Release Managers in the #release-management Slack channel to restart the job for you.

It shouldn't affect ability to consume images, but signatures might not work properly or at all if this error happens. Unfortunately, there's nothing much we can do at this point, but we hope we'll be able to kick off the promo-tools refactor efforts soon.

@cahillsf
Copy link
Member

similar failures in the patch release and minor releases for CAPI today. one patch release failing at the signing stage: https://prow.k8s.io/log?job=post-k8sio-image-promo&id=1780295493562142720

time="18:09:05.150" level=fatal msg="run `cip run`: promote images: signing images: replicating signatures: copying signature us-west2-docker.pkg.dev/k8s-artifacts-prod/images/cluster-api/clusterctl:sha256-e35d576ae8922459d284077fed7b2a49447b4cb835c69312327c52d75dafa8a4.sig to southamerica-west1-docker.pkg.dev/k8s-artifacts-prod/images/cluster-api/clusterctl:sha256-e35d576ae8922459d284077fed7b2a49447b4cb835c69312327c52d75dafa8a4.sig: PUT https://southamerica-west1-docker.pkg.dev/v2/k8s-artifacts-prod/images/cluster-api/clusterctl/manifests/sha256-e35d576ae8922459d284077fed7b2a49447b4cb835c69312327c52d75dafa8a4.sig: TOOMANYREQUESTS: Quota exceeded for quota metric 'Requests per project per user' and limit 'Requests per project per user per minute per user' of service 'artifactregistry.googleapis.com' for consumer 'project_number:388270116193'. (and 1 more errors)" diff=4.378s
{"component":"entrypoint","error":"wrapped process failed: exit status 

and the minor release job failing at filtering edges: https://prow.k8s.io/log?job=post-k8sio-image-promo&id=1780297426096099328

time="18:10:24.256" level=fatal msg="run `cip run`: promote images: filtering edges: filtering promotion edges: reading registries: getting tag list: GET https://us-central1-docker.pkg.dev/v2/token?scope=repository%3Ak8s-artifacts-prod%2Fimages%2Fcluster-api%2Fclusterctl%3Apull&service=: TOOMANYREQUESTS: Quota exceeded for quota metric 'Requests per project per user' and limit 'Requests per project per user per minute per user' of service 'artifactregistry.googleapis.com' for consumer 'project_number:388270116193'." diff=28ms
{"component":"entrypoint","error":"wrapped process failed: exit status 

@xmudrii
Copy link
Member

xmudrii commented Apr 16, 2024

The first failure can be ignored, the second job should be restarted. Can you please send a link to the job so that we can restart it?

@cahillsf
Copy link
Member

cahillsf commented Apr 16, 2024

@xmudrii
Copy link
Member

xmudrii commented Apr 16, 2024

@cahillsf Restarted the job and now it's green https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/post-k8sio-image-promo/1780300931636662272

@cahillsf
Copy link
Member

thanks for your help @xmudrii !

@BenTheElder
Copy link
Member

possibly related: #842

@BenTheElder
Copy link
Member

hit this with v1.30 release kubernetes/kubernetes#126170

also the initial promo job didn't report failure, I think? but we didn't have all regions synced

@BenTheElder
Copy link
Member

time="19:28:06.925" level=info msg="Registry: gcr.io/k8s-staging-scheduler-plugins Image: controller Got: gcr.io/k8s-staging-scheduler-plugins/controller" diff=141ms
time="19:28:07.077" level=fatal msg="run cip run: promote images: filtering edges: filtering promotion edges: reading registries: getting tag list: GET https://us-west1-docker.pkg.dev/v2/token?scope=repository%3Ak8s-artifacts-prod%2Fimages%2Fsig-storage%2Fsnapshot-controller%3Apull&service=: TOOMANYREQUESTS: Quota exceeded for quota metric 'Requests per project per region' and limit 'Requests per project per region per minute per region' of service 'artifactregistry.googleapis.com' for consumer 'project_number:388270116193'." diff=152ms

I'm guessing there is a gap in using the rate-limit aware client.

@BenTheElder
Copy link
Member

#842 ?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 15, 2024
@BenTheElder
Copy link
Member

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/release-eng Issues or PRs related to the Release Engineering subproject kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/release Categorizes an issue or PR as relevant to SIG Release.
Projects
None yet
Development

No branches or pull requests

7 participants