-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spire-server too high CPU usage #2827
Comments
That is suspicious, particularly with no load! If you turn on debug logs, is there anything in the logs that might indicate what the server is doing? Another option here is to enable capture a CPU profile when this is reproduced. You can do so by setting the following configurables:
Then you can use something like the following to capture the profile:
You can then attach that to the issue or analyze it yourself using |
@azdagron I reproduced it again. I linked a log file and two different capture files that show very similar status: |
Interesting. Looks like the k8sbundle notifier plugin is spinning trying to update the bundle after a server CA rotation. We need to add some more logging to figure out what's happening. I'll see if I can't replicate this locally. |
If you use the config I referred, most probably you will be able to reproduce it. I managed to do so in several environments. |
Awesome, I can send a patch easily. Also, we can speed up time-to-reproduce by turning down the CA TTL, since this happens on CA rotation. Maybe set the |
I'm still working on reproducing in my environment, but here is a branch on my fork with extra logging: https://github.com/azdagron/spire/tree/add-k8sbundle-logging |
It's on top of the v1.1.0 release, so you don't have to worry about other changes. |
I set |
Oh, was this with an existing deployment? If so, your CA is probably still valid. |
small clarification: CA TTL changes don't take effect until SPIRE rotates into a new CA |
As I see CA TTL has 24h as default value. I don't understand why this happens in an hour if it depends on CA rotation. |
Oh, yes, you are right. How long was the cluster alive before the repro? Is it possible it had been running for a few days, enough that it would have stale CAs in the bundle that needed to be purged? I have a local environment up with the patched image and am just waiting for a repro myself. |
I reproduced the problem and it seems like the watch channel for the validation webhook is being closed unexpectedly causing an infinite loop on the select. Need to figure out how the channel is being closed. |
Good news that you have also seen it. I usually tried it on freshly created kind clusters that definitely not running for days. As I wrote in the first comment it usually happen after 50-60 mins. However today I observed it after 34 and 35 minutes. Did you use the configuration I linked or your own? |
I used your config. It took somewhere between 30-60 minutes but I wasn't paying complete attention when it started happening because I was in a meeting but eventually noticed that my fans were exploding 😆 |
Hmm, we never explicitly stop the watcher, so it must be that an error has occurred on the watcher. Unfortunately my logs were too noisy and lost the reason for the failure. I'll repro again with some slight changes to help capture. |
|
The watch on the spiffe.io/webhook ValidatingWebhookConfiguration just closes without any sort of error being sent or anything. Guess I need to dig into the apimachinery repo to figure why this would happen. |
I suspect what is happening is that the k8sbundle notifier uses the low-level apimachinery clients to do the watch. The API server has a request timeout of 30m by default after which the request is closed and the watch channel is closed on the client side. Higher level clients implement retries. We should probably switch to the high level clients. Alternatively we need to implement our own retries. |
I think we need to refactor the k8sbundle notifier to leverage informers from client-go instead of the raw watch's. They will handle retries along with other transient conditions and keep things in sync. They will incur a small overhead by caching the resources in memory but I suspect that to be negligible in normal circumstances. |
@faisal-memon Sure, I will do so and come back with my feedback today or tomorrow at the latest. |
@faisal-memon Something is wrong with the image. I get this error when try to switch to this one in my test environment: What architecture the image was created for? |
ARM, I have an M1 Mac. Let me see if i can get you an x86 build. |
If @faisal-memon submits a PR, the CI/CD pipeline will build a container that you can download from the archived artifacts on the action and import into docker. |
Ok, WIP PR opened. Thats a lot easier than trying to emulate x86 on arm. |
Unfortunately, there is a GH action outage right now 😰 |
@azdagron @faisal-memon We are seeing this in NSM... and are hoping to get our NSM v1.3 release out (release candidate 2022-03-28 Release 2022-04-04). What's the ETA on getting this bug fixed? |
@edwarnicke @szvincze I pushed an x86 image, same tag |
For eta, I should be able remove the wip status by end of this week and then the code review process.
… On Mar 16, 2022, at 7:51 PM, Ed Warnicke ***@***.***> wrote:
@azdagron @faisal-memon We are seeing this in NSM... and are hoping to get our NSM v1.3 release out (release candidate 2022-03-28 Release 2022-04-04). What's the ETA on getting this bug fixed?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.
|
@faisal-memon Now my test is running. I will come back with the results during the day. |
@faisal-memon Running for 3 hours, fans are silent, CPU is cold, spire-server is not on the screen in |
@faisal-memon Sounds like the fix works! When can we expect a release containing it? |
PR is under active review, but it should be merged well in advance (and included in) our 1.2.2 release, which will release early to mid April. |
@azdagron Got it, so we will need to look at intermediate alternatives for our Apr 4 v1.3 release :) (like spinning a custom container). We'll be back on a release version as soon as we can manage :) |
I suspect the PR will be cleanly merge-able, so you may be able to get away with cherry-picking the resulting commit onto 1.2.1 :) |
Upgraded spire components to 1.3.0 to avoid spire-server too high CPU usage spiffe/spire#2827
Upgraded spire components to 1.3.0 to avoid spire-server too high CPU usage spiffe/spire#2827
Upgraded spire components to 1.3.0 to avoid spire-server too high CPU usage spiffe/spire#2827
Upgraded spire components to 1.3.0 to avoid spire-server too high CPU usage spiffe/spire#2827
Upgraded spire components to 1.3.0 to avoid spire-server too high CPU usage spiffe/spire#2827
spire-server is running with no agent and without any load on it. CPU consumption is under 1% for ~50-60 minutes, then suddenly it starts consuming ~150-160% CPU and it persists until shutting down the spire-server. The same happens with all replicas.
Different environments show different results but the CPU resource consumption is always jumps much higher after similar amount of time spent (~50-60 mins).
I observed the above mentioned ~150-160% on a kind cluster, 95-100% on minikube, 100-110% on kvm/qemu and 45-50% on non-virtual k8s environment.
I tried the same with spire-server 1.0.1, 1.1.0 and 1.2.0 images and got the same results.
I also created a configuration without k8s-workload-registrar too, but did not help.
Most probably it depends on the configuration I use since I have not managed to reproduce it with the reference configuration.
Can you please tell me what is wrong in this configuration? What is the culprit and how can I fix it?
Thanks in advance,
Szilard
The text was updated successfully, but these errors were encountered: