Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of Fix xDS missing endpoint race condition. into release/1.16.x #19873

Conversation

hc-github-team-consul-core
Copy link
Contributor

Backport

This PR is auto-generated from #19866 to be assessed for backporting due to the inclusion of the label backport/1.16.

🚨

Warning automatic cherry-pick of commits failed. If the first commit failed,
you will see a blank no-op commit below. If at least one commit succeeded, you
will see the cherry-picked commits up to, not including, the commit where
the merge conflict occurred.

The person who merged in the original PR is:
@hashi-derek
This person should manually cherry-pick the original PR into a new backport PR,
and close this one when the manual backport PR is merged in.

merge conflict error: POST https://api.github.com/repos/hashicorp/consul/merges: 409 Merge conflict []

The below text is copied from the body of the original PR.


The following PR is mostly a clone of work done by @ksmiley with some minor tweaks. I would like to thank him for tracking down and describing this complicated situation in such great detail. His work is greatly appreciated.

See the following issues for more context:

#17640
#17641

This fixes the following race condition:

  • Send update endpoints
  • Send update cluster
  • Recv ACK endpoints
  • Recv ACK cluster

Prior to this fix, it would have resulted in the endpoints NOT existing in Envoy. This occurred because the cluster update implicitly clears the endpoints in Envoy, but we would never re-send the endpoint data to compensate for the loss, because we would incorrectly ACK the invalid old endpoint hash. Since the endpoint's hash did not actually change, they would not be resent.

The fix for this is to effectively clear out the invalid pending ACKs for child resources whenever the parent changes. This ensures that we do not store the child's hash as accepted when the race occurs.

An escape-hatch environment variable XDS_PROTOCOL_LEGACY_CHILD_RESEND was added so that users can revert back to the old legacy behavior in the event that this produces unknown side-effects. Visit the following thread for some extra context on why certainty around these race conditions is difficult: envoyproxy/envoy#13009


Overview of commits

@hashicorp-cla
Copy link

hashicorp-cla commented Dec 8, 2023

CLA assistant check
All committers have signed the CLA.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto approved Consul Bot automated PR

@vercel vercel bot temporarily deployed to Preview – consul December 8, 2023 17:43 Inactive
@hashi-derek hashi-derek force-pushed the backport/derekm/NET-6565/resend-endpoints/probably-engaged-jay branch from e24dcb1 to 1a9a9fc Compare December 8, 2023 17:44
This fixes the following race condition:
- Send update endpoints
- Send update cluster
- Recv ACK endpoints
- Recv ACK cluster

Prior to this fix, it would have resulted in the endpoints NOT existing in
Envoy. This occurred because the cluster update implicitly clears the endpoints
in Envoy, but we would never re-send the endpoint data to compensate for the
loss, because we would incorrectly ACK the invalid old endpoint hash. Since the
endpoint's hash did not actually change, they would not be resent.

The fix for this is to effectively clear out the invalid pending ACKs for child
resources whenever the parent changes. This ensures that we do not store the
child's hash as accepted when the race occurs.

An escape-hatch environment variable `XDS_PROTOCOL_LEGACY_CHILD_RESEND` was
added so that users can revert back to the old legacy behavior in the event
that this produces unknown side-effects.

This bug report and fix was mostly implemented by @ksmiley with some minor
tweaks.

Co-authored-by: Keith Smiley <ksmiley@salesforce.com>
@hashi-derek hashi-derek force-pushed the backport/derekm/NET-6565/resend-endpoints/probably-engaged-jay branch from 1a9a9fc to b3b5b44 Compare December 8, 2023 17:50
@hashi-derek hashi-derek marked this pull request as ready for review December 8, 2023 18:16
@hashi-derek hashi-derek merged commit 2fd61e7 into release/1.16.x Dec 8, 2023
83 checks passed
@hashi-derek hashi-derek deleted the backport/derekm/NET-6565/resend-endpoints/probably-engaged-jay branch December 8, 2023 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants