Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keep ApiState pause/resume ref-counted, to make sure resume on the state is always called (always in pair). #11674

Closed
stevenzzzz opened this issue Jun 20, 2020 · 3 comments · Fixed by #12069
Assignees
Labels

Comments

@stevenzzzz
Copy link
Contributor

per Harvey, let's make the ApiState pause/resume operations always come in pair.
Currently there is no guarantee on when/whether the "resume" will be called after the ApiState has been "pause"d.

We should consider mechanism like RAII object which auto-resume the paused resource type when the object died.

@stevenzzzz
Copy link
Contributor Author

/assign stevenzzzz

@stevenzzzz
Copy link
Contributor Author

PR #11739 made an improvement on returning a Cleanup instance which resumes the paused resources on destruction.
This work well with most of the callsites in our codebase, i.e., the pause/resume happens in pair in a single main thread dispatcher event.
The only exception is in ClusterManagerImpl::updateClusterCount, this method is called by multiple functions in cluster manager impl, and the pause-resume may span across several main thread dispatcher events[1]. This is not ideal, as it is hard to tell when the pause happens and when will it be resumed. also a possible stuck eds response could stop CDS from been continued. we will sort out the right move there and see if we can achieve the "resume when out of pause scope" goal for all Pause callsites.

[1] https://github.com/envoyproxy/envoy/blob/master/source/common/upstream/cluster_manager_impl.cc#L806-L808

@htuch htuch removed the help wanted Needs help! label Jul 13, 2020
@htuch htuch assigned htuch and unassigned stevenzzzz Jul 13, 2020
@htuch
Copy link
Member

htuch commented Jul 13, 2020

I'll take this on in the context of #11877. I realized that we need to pause/resume the type URL undergoing update there, which means we definitely need ref counting (since we have existing pause/resumes nested inside config updates).

htuch added a commit to htuch/envoy that referenced this issue Jul 14, 2020
To fix envoyproxy#11877, we need to handle safely the case where two
watches point at the same resource, and a WatchMap onConfigUpdate() causes one watch to
remove the other watch during its invoked onConfigUpdate().

While working on this, it made sense to fix envoyproxy#11674,
avoiding spurious ClusterLoadAssignment discovery requests in the regression integration test.

Risk level: Medium (this has xDS wire-level implications).
Testing: New unit tests for pause/resume, regression unit and integration tests for watch map
  removal behaviors.

Fixes envoyproxy#11877 envoyproxy#11674

Signed-off-by: Harvey Tuch <htuch@google.com>
Co-authored-by: Sebastian Schepens <sebastian.schepens@mercadolibre.com>
Signed-off-by: Harvey Tuch <htuch@google.com>
htuch added a commit that referenced this issue Jul 24, 2020
…#12069)

To fix #11877, we need to handle safely the case where two
watches point at the same resource, and a WatchMap onConfigUpdate() causes one watch to
remove the other watch during its invoked onConfigUpdate().

While working on this, it made sense to fix #11674,
avoiding spurious ClusterLoadAssignment discovery requests in the regression integration test.

Risk level: Medium (this has xDS wire-level implications).
Testing: New unit tests for pause/resume, regression unit and integration tests for watch map
removal behaviors.

Fixes #11877 #11674

Signed-off-by: Harvey Tuch <htuch@google.com>
Co-authored-by: Sebastian Schepens <sebastian.schepens@mercadolibre.com>
KBaichoo pushed a commit to KBaichoo/envoy that referenced this issue Jul 30, 2020
…envoyproxy#12069)

To fix envoyproxy#11877, we need to handle safely the case where two
watches point at the same resource, and a WatchMap onConfigUpdate() causes one watch to
remove the other watch during its invoked onConfigUpdate().

While working on this, it made sense to fix envoyproxy#11674,
avoiding spurious ClusterLoadAssignment discovery requests in the regression integration test.

Risk level: Medium (this has xDS wire-level implications).
Testing: New unit tests for pause/resume, regression unit and integration tests for watch map
removal behaviors.

Fixes envoyproxy#11877 envoyproxy#11674

Signed-off-by: Harvey Tuch <htuch@google.com>
Co-authored-by: Sebastian Schepens <sebastian.schepens@mercadolibre.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
chaoqin-li1123 pushed a commit to chaoqin-li1123/envoy that referenced this issue Aug 7, 2020
…envoyproxy#12069)

To fix envoyproxy#11877, we need to handle safely the case where two
watches point at the same resource, and a WatchMap onConfigUpdate() causes one watch to
remove the other watch during its invoked onConfigUpdate().

While working on this, it made sense to fix envoyproxy#11674,
avoiding spurious ClusterLoadAssignment discovery requests in the regression integration test.

Risk level: Medium (this has xDS wire-level implications).
Testing: New unit tests for pause/resume, regression unit and integration tests for watch map
removal behaviors.

Fixes envoyproxy#11877 envoyproxy#11674

Signed-off-by: Harvey Tuch <htuch@google.com>
Co-authored-by: Sebastian Schepens <sebastian.schepens@mercadolibre.com>
Signed-off-by: chaoqinli <chaoqinli@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants