-
Notifications
You must be signed in to change notification settings - Fork 333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(kds): fix the case when webhook/db reject resource #10315
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
lukidzi
added
the
ci/run-full-matrix
PR: Runs all possible e2e test combination (expensive use carefully)
label
May 24, 2024
we shouldn't be holding up all types when only one fails |
Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
lukidzi
requested review from
jijiechen,
michaelbeaumont,
lobkovilya and
jakubdyszkiewicz
and removed request for
a team
May 28, 2024 12:53
jakubdyszkiewicz
requested changes
May 28, 2024
… into fix/kds-resource-not-synced
Test fails because it uses kuma 2.6.6 which also has this bug and needs backport |
jakubdyszkiewicz
requested changes
May 29, 2024
Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
lukidzi
added
backport
and removed
ci/run-full-matrix
PR: Runs all possible e2e test combination (expensive use carefully)
labels
May 29, 2024
jakubdyszkiewicz
approved these changes
May 29, 2024
backporting to release-2.6 with action |
kumahq bot
pushed a commit
that referenced
this pull request
May 29, 2024
Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
kumahq bot
pushed a commit
that referenced
this pull request
May 29, 2024
Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
This was referenced May 29, 2024
kumahq bot
pushed a commit
that referenced
this pull request
May 29, 2024
Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
kumahq bot
pushed a commit
that referenced
this pull request
May 29, 2024
Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
This was referenced May 29, 2024
kumahq bot
pushed a commit
that referenced
this pull request
May 29, 2024
Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
lukidzi
added a commit
that referenced
this pull request
May 29, 2024
…10315) (#10351) * fix(kds): fix the case when webhook/db reject resource (#10315) Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com> * fix conflict Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com> --------- Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com> Co-authored-by: Lukasz Dziedziak <lukidzi@gmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Checklist prior to review
Context
Lately we can observe that one specific e2e test fails quite often.
It fails with specific errors which leads to situation where KDS resources were not synced correctly:
This case checks if we have 3 MeshTimeout policies (2 defaults and 1 created)
Root cause
After debugging, I've noticed that we can reproduce this situation on K8s or manually (code/db change). It happens because mesh policies have an owner reference on the Mesh object. Once the Mesh object is not available on the cluster, default policies are rejected. Then, the resource_retry_forcer sleeps for 5 seconds before forcing the snapshot to recalculate the state for the specific resource and send the changes again. Unfortunately, when more than one resource type fails, it recalculates and goes into another sleep, waiting for another 5 seconds. This sleep part is blocking and holds up all the discovery requests for all resource types. Because of this, the version refreshes after the first 5 seconds, but the callback is unlocked after 15 seconds, causing that snapshot refreshed at this time.
Current solution
Instead of waiting on
OnStreamDeltaRequest
, we will wait onOnStreamDeltaRequest
. This will allow us to have a watch that observes changes, ensuring that we don't miss any updates. Additionally, we will wait for a backoff on the stream as a whole, rather than for each individual type.Reproduction
branch with changes https://github.com/lukidzi/kuma/pull/new/reproduce-bug
KUMA_MODE=global ./kuma-cp run
KUMA_MODE=zone KUMA_DIAGNOSTICS_SERVER_PORT=15680 KUMA_API_SERVER_HTTP_PORT=15681 KUMA_MULTIZONE_ZONE_NAME=zone KUMA_DP_SERVER_PORT=15678 KUMA_API_SERVER_HTTPS_PORT=15682 KUMA_MONITORING_ASSIGNMENT_SERVER_PORT=15676 KUMA_INTER_CP_SERVER_PORT=15683 KUMA_MULTIZONE_ZONE_KDS_TLS_SKIP_VERIFY=true KUMA_MULTIZONE_ZONE_GLOBAL_ADDRESS=grpcs://localhost:5685 ./kuma-cp run
./kumactl config control-planes switch --name global && ./kumactl apply -f mt.yam
syscall.Mkfifo
have equivalent implementation on the other OS --ci/
labels to run additional/fewer testsUPGRADE.md
? --