Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(kds): fix the case when webhook/db reject resource #10315

Merged
merged 10 commits into from
May 29, 2024

Conversation

lukidzi
Copy link
Contributor

@lukidzi lukidzi commented May 24, 2024

Checklist prior to review

Context

Lately we can observe that one specific e2e test fails quite often.

  [FAIL] Test upgrading Multizone with Helm chart [It] should sync policies from Global to Zone [job-2]
  github.com/kumahq/kuma/test/e2e/helm/kuma_helm_upgrade_multizone.go:135

It fails with specific errors which leads to situation where KDS resources were not synced correctly:

This case checks if we have 3 MeshTimeout policies (2 defaults and 1 created)

  Expected
      <int>: 1
  to equal
      <int>: 3

Root cause

After debugging, I've noticed that we can reproduce this situation on K8s or manually (code/db change). It happens because mesh policies have an owner reference on the Mesh object. Once the Mesh object is not available on the cluster, default policies are rejected. Then, the resource_retry_forcer sleeps for 5 seconds before forcing the snapshot to recalculate the state for the specific resource and send the changes again. Unfortunately, when more than one resource type fails, it recalculates and goes into another sleep, waiting for another 5 seconds. This sleep part is blocking and holds up all the discovery requests for all resource types. Because of this, the version refreshes after the first 5 seconds, but the callback is unlocked after 15 seconds, causing that snapshot refreshed at this time.

Current solution

Instead of waiting on OnStreamDeltaRequest, we will wait on OnStreamDeltaRequest. This will allow us to have a watch that observes changes, ensuring that we don't miss any updates. Additionally, we will wait for a backoff on the stream as a whole, rather than for each individual type.

Reproduction

branch with changes https://github.com/lukidzi/kuma/pull/new/reproduce-bug

  1. Start KUMA_MODE=global ./kuma-cp run
  2. Prepare MeshTimeout policy and configure kumactl to global, prepare command
  3. Start KUMA_MODE=zone KUMA_DIAGNOSTICS_SERVER_PORT=15680 KUMA_API_SERVER_HTTP_PORT=15681 KUMA_MULTIZONE_ZONE_NAME=zone KUMA_DP_SERVER_PORT=15678 KUMA_API_SERVER_HTTPS_PORT=15682 KUMA_MONITORING_ASSIGNMENT_SERVER_PORT=15676 KUMA_INTER_CP_SERVER_PORT=15683 KUMA_MULTIZONE_ZONE_KDS_TLS_SKIP_VERIFY=true KUMA_MULTIZONE_ZONE_GLOBAL_ADDRESS=grpcs://localhost:5685 ./kuma-cp run
  4. Once you see the kids rejection run ./kumactl config control-planes switch --name global && ./kumactl apply -f mt.yam
  5. After 20 seconds check if zone has 3 policies if no you reproduced the bug, if yes restart everything
  • Link to relevant issue as well as docs and UI issues --
  • This will not break child repos: it doesn't hardcode values (.e.g "kumahq" as a image registry) and it will work on Windows, system specific functions like syscall.Mkfifo have equivalent implementation on the other OS --
  • Tests (Unit test, E2E tests, manual test on universal and k8s) --
    • Don't forget ci/ labels to run additional/fewer tests
  • Do you need to update UPGRADE.md? --
  • Does it need to be backported according to the backporting policy? (this GH action will add "backport" label based on these file globs, if you want to prevent it from adding the "backport" label use no-backport-autolabel label) --

Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
@lukidzi lukidzi added the ci/run-full-matrix PR: Runs all possible e2e test combination (expensive use carefully) label May 24, 2024
@slonka
Copy link
Contributor

slonka commented May 27, 2024

we shouldn't be holding up all types when only one fails

Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
@lukidzi lukidzi marked this pull request as ready for review May 28, 2024 12:53
@lukidzi lukidzi requested a review from a team as a code owner May 28, 2024 12:53
@lukidzi lukidzi requested review from jijiechen, michaelbeaumont, lobkovilya and jakubdyszkiewicz and removed request for a team May 28, 2024 12:53
Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
@lukidzi
Copy link
Contributor Author

lukidzi commented May 28, 2024

Test fails because it uses kuma 2.6.6 which also has this bug and needs backport

Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
@lukidzi lukidzi added backport and removed ci/run-full-matrix PR: Runs all possible e2e test combination (expensive use carefully) labels May 29, 2024
@lukidzi lukidzi merged commit cd189ee into kumahq:master May 29, 2024
15 checks passed
Copy link
Contributor

github-actions bot commented May 29, 2024

backporting to release-2.4 with action

backporting to release-2.5 with action
backporting to release-2.7 with action
backporting to release-2.3 with action

Copy link
Contributor

backporting to release-2.6 with action

kumahq bot pushed a commit that referenced this pull request May 29, 2024
Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
kumahq bot pushed a commit that referenced this pull request May 29, 2024
Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
kumahq bot pushed a commit that referenced this pull request May 29, 2024
Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
kumahq bot pushed a commit that referenced this pull request May 29, 2024
Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
kumahq bot pushed a commit that referenced this pull request May 29, 2024
Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
lukidzi added a commit that referenced this pull request May 29, 2024
…10315) (#10351)

* fix(kds): fix the case when webhook/db reject resource (#10315)

Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>

* fix conflict

Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>

---------

Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
Co-authored-by: Lukasz Dziedziak <lukidzi@gmail.com>
lukidzi added a commit that referenced this pull request May 31, 2024
…10315) (#10353)


Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
Co-authored-by: Lukasz Dziedziak <lukidzi@gmail.com>
lukidzi added a commit that referenced this pull request May 31, 2024
…10315) (#10352)



Signed-off-by: Lukasz Dziedziak <lukidzi@gmail.com>
Co-authored-by: Lukasz Dziedziak <lukidzi@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants