Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proxy reset with pagination and reconciliation requeue #926

Merged
merged 23 commits into from
Jul 18, 2024

Conversation

videlov
Copy link
Collaborator

@videlov videlov commented Jul 8, 2024

Description

Changes proposed in this pull request:

  • Proxy reset with pagination and reconciliation requeue

Pre-Merge Checklist

  • As a PR reviewer, verify code coverage and evaluate if it is acceptable.

Related issues

#155

@kyma-bot kyma-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 8, 2024
@kyma-bot
Copy link
Contributor

kyma-bot commented Jul 8, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@kyma-bot kyma-bot added the cla: yes Indicates the PR's author has signed the CLA. label Jul 8, 2024
@kyma-bot kyma-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 8, 2024
@kyma-bot kyma-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 9, 2024
@videlov videlov marked this pull request as ready for review July 11, 2024 11:11
@videlov videlov requested review from a team as code owners July 11, 2024 11:11
@kyma-bot kyma-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 11, 2024
@werdes72
Copy link
Collaborator

I just tested this code using 8 deployments with 22 replicas (pods) each.
The first run seems to have "joined" containers that belong to the same deployment:

2024-07-17T08:55:17.974314Z	info	controlleruntime	Running proxy sidecar reset	expected image=europe-docker.pkg.dev/kyma-project/prod/external/istio/proxyv2:1.22.2-distroless
2024-07-17T08:55:18.156964Z	info	controlleruntime	Got running pods for proxy restart	number of pods=100 has more pods=true
2024-07-17T08:55:18.157156Z	info	controlleruntime	Filtered pods with Istio sidecar	number of pods=31
2024-07-17T08:55:18.191781Z	info	controlleruntime	Got running pods for proxy restart	number of pods=100 has more pods=true
2024-07-17T08:55:18.192035Z	info	controlleruntime	Filtered pods with Istio sidecar	number of pods=100
2024-07-17T08:55:18.192076Z	info	controlleruntime	Pods to restart	number of pods=30 has more pods=true
2024-07-17T08:55:18.195720Z	info	controlleruntime	Roll out pod due to proxy restart	name=httpbin-1 namespace=test
2024-07-17T08:55:18.284005Z	info	controlleruntime	Roll out pod due to proxy restart	name=httpbin-2 namespace=test
2024-07-17T08:55:18.362620Z	info	controlleruntime	Proxy reset only partially completed
2024-07-17T08:55:18.377810Z	info	controlleruntime	Reconcile requeued

The problem is that in the next reconciliation it once again restarts httpbin-1 deployment:

2024-07-17T08:55:34.189453Z	info	controlleruntime	Running proxy sidecar reset	expected image=europe-docker.pkg.dev/kyma-project/prod/external/istio/proxyv2:1.22.2-distroless
2024-07-17T08:55:34.284951Z	info	controlleruntime	Got running pods for proxy restart	number of pods=100 has more pods=true
2024-07-17T08:55:34.285264Z	info	controlleruntime	Filtered pods with Istio sidecar	number of pods=34
2024-07-17T08:55:34.338815Z	info	controlleruntime	Got running pods for proxy restart	number of pods=100 has more pods=true
2024-07-17T08:55:34.339109Z	info	controlleruntime	Filtered pods with Istio sidecar	number of pods=100
2024-07-17T08:55:34.339148Z	info	controlleruntime	Pods to restart	number of pods=30 has more pods=true
2024-07-17T08:55:34.344185Z	info	controlleruntime	Roll out pod due to proxy restart	name=httpbin-1 namespace=test
2024-07-17T08:55:34.502469Z	info	controlleruntime	Roll out pod due to proxy restart	name=httpbin-2 namespace=test
2024-07-17T08:55:34.581657Z	info	controlleruntime	Roll out pod due to proxy restart	name=httpbin-3 namespace=test
2024-07-17T08:55:34.596141Z	info	controlleruntime	Proxy reset only partially completed

And then 3rd restart of the same deployment:

2024-07-17T08:55:55.088211Z	info	controlleruntime	Running proxy sidecar reset	expected image=europe-docker.pkg.dev/kyma-project/prod/external/istio/proxyv2:1.22.2-distroless
2024-07-17T08:55:55.189955Z	info	controlleruntime	Got running pods for proxy restart	number of pods=100 has more pods=true
2024-07-17T08:55:55.190329Z	info	controlleruntime	Filtered pods with Istio sidecar	number of pods=36
2024-07-17T08:55:55.248841Z	info	controlleruntime	Got running pods for proxy restart	number of pods=100 has more pods=true
2024-07-17T08:55:55.249220Z	info	controlleruntime	Filtered pods with Istio sidecar	number of pods=93
2024-07-17T08:55:55.249346Z	info	controlleruntime	Pods to restart	number of pods=30 has more pods=true
2024-07-17T08:55:55.256246Z	info	controlleruntime	Roll out pod due to proxy restart	name=httpbin-1 namespace=test
2024-07-17T08:55:55.323544Z	info	controlleruntime	Roll out pod due to proxy restart	name=httpbin-2 namespace=test
2024-07-17T08:55:55.410893Z	info	controlleruntime	Roll out pod due to proxy restart	name=httpbin-3 namespace=test
2024-07-17T08:55:55.479599Z	info	controlleruntime	Roll out pod due to proxy restart	name=httpbin-4 namespace=test
2024-07-17T08:55:55.566509Z	info	controlleruntime	Proxy reset only partially completed

Also, since not restarts were done, at the same time there are 4 deployments being restarted (in this case 88 pods). In the example below there are even 5 of them:

2024-07-17T08:57:36.970464Z	info	controlleruntime	Running proxy sidecar reset	expected image=europe-docker.pkg.dev/kyma-project/prod/external/istio/proxyv2:1.22.2-distroless
2024-07-17T08:57:37.172276Z	info	controlleruntime	Got running pods for proxy restart	number of pods=100 has more pods=true
2024-07-17T08:57:37.172496Z	info	controlleruntime	Filtered pods with Istio sidecar	number of pods=43
2024-07-17T08:57:37.246750Z	info	controlleruntime	Got running pods for proxy restart	number of pods=100 has more pods=true
2024-07-17T08:57:37.247062Z	info	controlleruntime	Filtered pods with Istio sidecar	number of pods=91
2024-07-17T08:57:37.265578Z	info	controlleruntime	Got running pods for proxy restart	number of pods=24 has more pods=false
2024-07-17T08:57:37.265681Z	info	controlleruntime	Filtered pods with Istio sidecar	number of pods=24
2024-07-17T08:57:37.265721Z	info	controlleruntime	Pods to restart	number of pods=30 has more pods=false
2024-07-17T08:57:37.270746Z	info	controlleruntime	Roll out pod due to proxy restart	name=httpbin-5 namespace=test
2024-07-17T08:57:37.290869Z	info	controlleruntime	Roll out pod due to proxy restart	name=httpbin-6 namespace=test
2024-07-17T08:57:37.380357Z	info	controlleruntime	Roll out pod due to proxy restart	name=httpbin-7 namespace=test
2024-07-17T08:57:37.512677Z	info	controlleruntime	Roll out pod due to proxy restart	name=httpbin-8 namespace=test
2024-07-17T08:57:37.554794Z	info	controlleruntime	Proxy reset completed

It would be good to have some kind of wait/check mechanism to prevent multiple restarts of the same deployment.

@werdes72
Copy link
Collaborator

After adding one minute wait, the process looks much better, with 20 pods per deployment it restarts 40 pods at the same time:

2024-07-17T11:06:28.480447Z	info	controlleruntime	Got running pods for proxy restart	number of pods=100 has more pods=true
2024-07-17T11:06:28.480629Z	info	controlleruntime	Filtered pods with Istio sidecar	number of pods=49
2024-07-17T11:06:28.480860Z	info	controlleruntime	Pods to restart	number of pods=30 has more pods=true
2024-07-17T11:06:28.484596Z	info	controlleruntime	Roll out pod due to proxy restart	name=httpbin-1 namespace=test
2024-07-17T11:06:28.577881Z	info	controlleruntime	Roll out pod due to proxy restart	name=httpbin-2 namespace=test
---
2024-07-17T11:07:37.458548Z	info	controlleruntime	Got running pods for proxy restart	number of pods=100 has more pods=true
2024-07-17T11:07:37.458710Z	info	controlleruntime	Filtered pods with Istio sidecar	number of pods=49
2024-07-17T11:07:37.498941Z	info	controlleruntime	Got running pods for proxy restart	number of pods=100 has more pods=true
2024-07-17T11:07:37.499178Z	info	controlleruntime	Filtered pods with Istio sidecar	number of pods=100
2024-07-17T11:07:37.499309Z	info	controlleruntime	Pods to restart	number of pods=30 has more pods=true
2024-07-17T11:07:37.502948Z	info	controlleruntime	Roll out pod due to proxy restart	name=httpbin-3 namespace=test
2024-07-17T11:07:37.598163Z	info	controlleruntime	Roll out pod due to proxy restart	name=httpbin-4 namespace=test
---
2024-07-17T11:08:47.089517Z	info	controlleruntime	Got running pods for proxy restart	number of pods=100 has more pods=true
2024-07-17T11:08:47.089721Z	info	controlleruntime	Filtered pods with Istio sidecar	number of pods=48
2024-07-17T11:08:47.150624Z	info	controlleruntime	Got running pods for proxy restart	number of pods=100 has more pods=true
2024-07-17T11:08:47.151435Z	info	controlleruntime	Filtered pods with Istio sidecar	number of pods=100
2024-07-17T11:08:47.152158Z	info	controlleruntime	Pods to restart	number of pods=30 has more pods=true
2024-07-17T11:08:47.156164Z	info	controlleruntime	Roll out pod due to proxy restart	name=httpbin-5 namespace=test
2024-07-17T11:08:47.265287Z	info	controlleruntime	Roll out pod due to proxy restart	name=httpbin-6 namespace=test
---
2024-07-17T11:09:56.572912Z	info	controlleruntime	Got running pods for proxy restart	number of pods=100 has more pods=true
2024-07-17T11:09:56.573076Z	info	controlleruntime	Filtered pods with Istio sidecar	number of pods=48
2024-07-17T11:09:56.611809Z	info	controlleruntime	Got running pods for proxy restart	number of pods=100 has more pods=true
2024-07-17T11:09:56.612045Z	info	controlleruntime	Filtered pods with Istio sidecar	number of pods=100
2024-07-17T11:09:56.621784Z	info	controlleruntime	Got running pods for proxy restart	number of pods=12 has more pods=false
2024-07-17T11:09:56.621833Z	info	controlleruntime	Filtered pods with Istio sidecar	number of pods=12
2024-07-17T11:09:56.621855Z	info	controlleruntime	Pods to restart	number of pods=30 has more pods=false
2024-07-17T11:09:56.624963Z	info	controlleruntime	Roll out pod due to proxy restart	name=httpbin-7 namespace=test
2024-07-17T11:09:56.742104Z	info	controlleruntime	Roll out pod due to proxy restart	name=httpbin-8 namespace=test
2024-07-17T11:09:56.975210Z	info	controlleruntime	Proxy reset completed
---
2024-07-17T11:10:31.029368Z	info	controlleruntime	Got running pods for proxy restart	number of pods=100 has more pods=true
2024-07-17T11:10:31.029571Z	info	controlleruntime	Filtered pods with Istio sidecar	number of pods=47
2024-07-17T11:10:31.075916Z	info	controlleruntime	Got running pods for proxy restart	number of pods=100 has more pods=true
2024-07-17T11:10:31.076110Z	info	controlleruntime	Filtered pods with Istio sidecar	number of pods=90
2024-07-17T11:10:31.090040Z	info	controlleruntime	Got running pods for proxy restart	number of pods=16 has more pods=false
2024-07-17T11:10:31.090109Z	info	controlleruntime	Filtered pods with Istio sidecar	number of pods=13
2024-07-17T11:10:31.090300Z	info	controlleruntime	Pods to restart	number of pods=2 has more pods=false
2024-07-17T11:10:31.094559Z	info	controlleruntime	Roll out pod due to proxy restart	name=httpbin-8 namespace=test

There was some other trigger for reconcile process which made httpbin-8 restart twice.

@videlov videlov requested a review from werdes72 July 17, 2024 11:38
werdes72
werdes72 previously approved these changes Jul 17, 2024
@kyma-bot kyma-bot added the lgtm Looks good to me! label Jul 17, 2024
docs/contributor/04-10-technical-design.md Outdated Show resolved Hide resolved
docs/contributor/04-10-technical-design.md Outdated Show resolved Hide resolved
docs/contributor/04-10-technical-design.md Outdated Show resolved Hide resolved
docs/contributor/04-10-technical-design.md Outdated Show resolved Hide resolved
docs/contributor/04-10-technical-design.md Outdated Show resolved Hide resolved
docs/release-notes/1.9.0.md Outdated Show resolved Hide resolved
Co-authored-by: Natalia Sitko <80401180+nataliasitko@users.noreply.github.com>
@kyma-bot kyma-bot removed the lgtm Looks good to me! label Jul 18, 2024
videlov and others added 4 commits July 18, 2024 09:29
Co-authored-by: Natalia Sitko <80401180+nataliasitko@users.noreply.github.com>
Co-authored-by: Natalia Sitko <80401180+nataliasitko@users.noreply.github.com>
Co-authored-by: Natalia Sitko <80401180+nataliasitko@users.noreply.github.com>
Co-authored-by: Natalia Sitko <80401180+nataliasitko@users.noreply.github.com>
Co-authored-by: Natalia Sitko <80401180+nataliasitko@users.noreply.github.com>
@kyma-bot kyma-bot added the lgtm Looks good to me! label Jul 18, 2024
@kyma-bot kyma-bot merged commit 4d26ebd into kyma-project:main Jul 18, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes Indicates the PR's author has signed the CLA. lgtm Looks good to me! size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants