Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KDS delta sometimes drops resource kinds for a few seconds #9455

Closed
nicoche opened this issue Feb 29, 2024 · 4 comments
Closed

KDS delta sometimes drops resource kinds for a few seconds #9455

nicoche opened this issue Feb 29, 2024 · 4 comments
Assignees
Labels
kind/bug A bug triage/rotten closed due to lack of information for too long, rejected feature...

Comments

@nicoche
Copy link
Contributor

nicoche commented Feb 29, 2024

What happened?

Sometimes, when a zone connection to the global CP is destroyed, KDS detects that some resources disappeared. After the zone connection is re-established, the resources are re-seen as existing.
However, in the meantime, KDS will tell other zones that some resources have been deleted, so the zonal CP will delete them from their own database.

For example:

  • Stream cp-global <-> cp-zone1 is destroyed
  • Stream cp-global <-> cp-zone2: CP global noticed that Secrets X Y Z have been destroyed
  • Stream cp-global <-> cp-zone1 is back up
  • Stream cp-global <-> cp-zone2: CP global noticed that Secrets X Y Z have been created

Here are some logs:
We have 6 zones
was1 disconnects at 13:46:30
KDS logs stream cancelled at 13:46:31
KDS detects changes to Mesh in zones fra1 and sin1 (!) while nothing has changed at 13:46:32
was1 reconnects at 13:46:34
I didn't put it in the logs here after, but KDS re-detects changes to Mesh for fra1 and sin1

2024-02-22T13:46:30.997Z        INFO    kds-delta-client        ZoneToGlobalSync rpc stream stopped     {"clientID": "was1"}
2024-02-22T13:46:30.997Z        INFO    kds-delta-client        GlobalToZoneSync rpc stream stopped     {"clientID": "was1"}
2024-02-22T13:46:31.000Z        INFO    kds-service     stream cancelled        {"rpc": "Stats", "clientID": "was1"}
2024-02-22T13:46:31.000Z        INFO    kds-service     stream cancelled        {"rpc": "Clusters", "clientID": "was1"}
2024-02-22T13:46:31.000Z        INFO    kds-service     stream cancelled        {"rpc": "XDS Config Dump", "clientID": "was1"}
2024-02-22T13:46:32.552Z        INFO    kds-delta-global        detected changes in the resources. Sending changes to the client.       {"streamID": 12, "nodeID": "fra1", "resourceType": "Mesh", "client": "fra1"}
2024-02-22T13:46:32.556Z        INFO    kds-delta-global        detected changes in the resources. Sending changes to the client.       {"streamID": 9, "nodeID": "sin1", "resourceType": "Mesh", "client": "sin1"}
2024-02-22T13:46:34.713Z        INFO    kds-service     Envoy Admin RPC stream started  {"rpc": "Stats", "clientID": "was1"}
2024-02-22T13:46:34.814Z        INFO    kds-delta-global        Global To Zone new session created      {"peer-id": "was1"}
2024-02-22T13:46:34.814Z        INFO    kds-service     Envoy Admin RPC stream started  {"rpc": "Clusters", "clientID": "was1"}
2024-02-22T13:46:34.917Z        INFO    kds-service     Envoy Admin RPC stream started  {"rpc": "XDS Config Dump", "clientID": "was1"}

logs-secret-destruction.txt

More details and logs here: https://kuma-mesh.slack.com/archives/CN2GN4HE1/p1708717249211629

Kuma version: 2.5.x

@nicoche nicoche added kind/bug A bug triage/pending This issue will be looked at on the next triage meeting labels Feb 29, 2024
@jakubdyszkiewicz jakubdyszkiewicz added triage/needs-reproducing Someone else should try to reproduce this and removed triage/pending This issue will be looked at on the next triage meeting labels Mar 4, 2024
@jakubdyszkiewicz jakubdyszkiewicz self-assigned this Mar 4, 2024
@lahabana
Copy link
Contributor

lahabana commented Apr 11, 2024

@jakubdyszkiewicz didn't you mention a recent fix that may fix this?

@jakubdyszkiewicz
Copy link
Contributor

jakubdyszkiewicz commented Apr 15, 2024

It may be related, but not necessary. We had a problem that we only retry NACK once.
Here is the PR #9736

@jakubdyszkiewicz
Copy link
Contributor

xref #10315

@jakubdyszkiewicz
Copy link
Contributor

Triage: we were not able to reproduce this in 2.6.x. There were changes in KDS that potentially would help. Please try the newest version. We could use some minimal repro.
Please let us know if this happens with 2.6.x. We can reopen if needed

@jakubdyszkiewicz jakubdyszkiewicz closed this as not planned Won't fix, can't repro, duplicate, stale Jun 10, 2024
@jakubdyszkiewicz jakubdyszkiewicz added triage/rotten closed due to lack of information for too long, rejected feature... and removed triage/needs-reproducing Someone else should try to reproduce this labels Jun 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug triage/rotten closed due to lack of information for too long, rejected feature...
Projects
None yet
Development

No branches or pull requests

3 participants