Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xds/ringhash: make reconnection logic work for a single subConn #5601

Merged
merged 4 commits into from
Aug 26, 2022

Conversation

easwars
Copy link
Contributor

@easwars easwars commented Aug 22, 2022

Summary of changes:

  • Always send a picker update (even if the ring hasn't changed) when the LB policy receives a subConn update.
    • Earlier, we were only sending a picker update when the effective state of the subConn changed. Current change helps with re-queueing RPCs in cases where:
      • there is a single subConn, and/or
      • subConn move from TRANSIENT_FAILURE to a non READY state.
  • Regenerate the picker when the subConn moves to READY. This is required when the aggregate state of the channel is changing from TRANSIENT_FAILURE to READY, and we need to overwrite the earlier error picker that we had pushed.
  • Ensure that the LB policy is attempting to reconnect (when the aggregate state of the channel is TRANSIENT_FAILURE), even if there is a single subConn.

Fixes #5594

RELEASE NOTES:

  • xds/ringhash: fix a bug which sometimes prevents the LB policy from retrying connection attempts

@easwars easwars requested a review from dfawley August 22, 2022 21:25
@easwars easwars modified the milestones: 1.50 Release, 1.49 Release Aug 22, 2022
@zasweq zasweq modified the milestones: 1.49 Release, 1.50 Release Aug 24, 2022
@dfawley dfawley changed the title ringhash: make reconnect logic work for a single subConn xds/ringhash: make reconnection logic work for a single subConn Aug 24, 2022
@dfawley
Copy link
Member

dfawley commented Aug 24, 2022

Always send a picker update

On the other hand, this will cause RPC churn if the effective state of the ring hash hasn't changed as a result of the subconn state change. I.e. if the transition is from TF to IDLE, we can kick the subconn into CONNECTING (presumably that's the right thing to do) but not produce a new picker, because sticky-TF means that nothing has changed. If it doesn't add too much complexity to the code and isn't too much work, it would be nice to suppress updates where we know they are unnecessary.

@@ -161,11 +163,13 @@ func (sc *subConn) queueConnect() {
defer sc.mu.Unlock()
sc.attemptingToConnect = true
if sc.state == connectivity.Idle {
sc.logger.Infof("Executing a queued connect for subConn in state: %v", sc.state)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logging the exact same text here could make it hard to distinguish which operation is occurring. Did we just queue the connect attempt, or was it queued when the subconn went IDLE? Maybe this distinction is unimportant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The connection attempt is queued (by calling queueConnect() method) from UpdateSubConnState() when we want to start connecting on the next available subConn.

But this queueConnect() method actually starts connecting only if the subConn is in IDLE. For all other states, it set the connectQueued bit to true, and when that subConn eventually transitions to IDLE, the connect attempt is made. See setState() which is called from UpdateSubConnState().

I think the distinction is not important here, whether the connection attempt happened right when it was queued or when it eventually moved to IDLE. What do you think?

xds/internal/balancer/ringhash/ringhash.go Show resolved Hide resolved
xds/internal/balancer/ringhash/ringhash.go Outdated Show resolved Hide resolved
xds/internal/balancer/ringhash/ringhash.go Outdated Show resolved Hide resolved
@dfawley dfawley assigned easwars and unassigned dfawley Aug 24, 2022
@easwars
Copy link
Contributor Author

easwars commented Aug 24, 2022

Always send a picker update

On the other hand, this will cause RPC churn if the effective state of the ring hash hasn't changed as a result of the subconn state change. I.e. if the transition is from TF to IDLE, we can kick the subconn into CONNECTING (presumably that's the right thing to do) but not produce a new picker, because sticky-TF means that nothing has changed. If it doesn't add too much complexity to the code and isn't too much work, it would be nice to suppress updates where we know they are unnecessary.

Undid these changes after an offline discussion.

Copy link
Contributor Author

@easwars easwars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review.

@@ -161,11 +163,13 @@ func (sc *subConn) queueConnect() {
defer sc.mu.Unlock()
sc.attemptingToConnect = true
if sc.state == connectivity.Idle {
sc.logger.Infof("Executing a queued connect for subConn in state: %v", sc.state)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The connection attempt is queued (by calling queueConnect() method) from UpdateSubConnState() when we want to start connecting on the next available subConn.

But this queueConnect() method actually starts connecting only if the subConn is in IDLE. For all other states, it set the connectQueued bit to true, and when that subConn eventually transitions to IDLE, the connect attempt is made. See setState() which is called from UpdateSubConnState().

I think the distinction is not important here, whether the connection attempt happened right when it was queued or when it eventually moved to IDLE. What do you think?

xds/internal/balancer/ringhash/ringhash.go Show resolved Hide resolved
xds/internal/balancer/ringhash/ringhash.go Outdated Show resolved Hide resolved
xds/internal/balancer/ringhash/ringhash.go Outdated Show resolved Hide resolved
@easwars easwars assigned dfawley and unassigned easwars Aug 24, 2022
Comment on lines 113 to 117
for ; ctx.Err() == nil; <-time.After(defaultTestShortTimeout) {
if cc.GetState() == connectivity.Idle {
break
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code golf:

for ; ctx.Err() == nil && cc.GetState != connectivity.Idle; <-time.After(...) {}

But why not use the connectivity state API? :

	for state := cc.GetState(); state != connectivity.Idle && cc.WaitForStateChange(ctx, state); state = cc.GetState() {}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@dfawley dfawley assigned easwars and unassigned dfawley Aug 26, 2022
@easwars easwars closed this Aug 26, 2022
@easwars easwars reopened this Aug 26, 2022
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

XDS - client connection failure and retries
3 participants