Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor connection with retries and backoff #78

Conversation

vladimirvivien
Copy link
Member

@vladimirvivien vladimirvivien commented Nov 30, 2018

This PR attempts to fix #76
It retries to connect to the driver (within the same session) several times before giving up. The logic is as follows:

  • Within apimachinery/wait.ExponentialBackoff condition do 5 times:
    • Attempt to connect with DialContext:
      • Block until a connection is successful
      • Give up if --connection-timeout reached
    • Evaluate any error from DialContext, if retryable:
      • Attempt to reconnect (based on error type)
    • If not, giveup right away
    • if number of attempts reached
      • Give up, stop
      • Else try again

Return successful connection or error if one was generated

Using flag --connection-timeout can be used to control how long a connection request lasts before the gives up and try again. Because this will retry right away, 5 times total, flag --connection-timeout should be set to a sensible value, something like 5 to 10 secs. For instance a 5 second time out will produce the following total attempt time:

total connection attempt time ≈ 5 sec * 5 attempts * backoffFactor if no connection is created within that time period, the code will stop.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 30, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vladimirvivien

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 30, 2018
@pohly
Copy link
Contributor

pohly commented Nov 30, 2018

It's not obvious to me at all how this will handle the situation that the CSI driver hasn't created the csi.socket yet. Does that go to the exponential backoff or will it be treated as a permanent error and the code returns immediately?

This probably needs a unit test.

@vladimirvivien
Copy link
Member Author

@pohly thanks for taking a look.

Yeah I just added/fixed the test. If that is not enough I can add more.

So I did look into the grpc code trying to understand how grpc connection behave. When block is enabled and DialContext is used, the code will wait timeout time until (socket file is present) connection is established. The backoff strategy just retries several times until it gives up. Which forces k8s to relaunch the container again.

@vladimirvivien
Copy link
Member Author

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 30, 2018
@saad-ali
Copy link
Member

Let's move this PR to https://github.com/kubernetes-csi/node-driver-registrar since this repo is deprecated.

@lpabon
Copy link
Member

lpabon commented Jan 9, 2019

🛂 ⛔️
Hi, this repo has been split up into two repos:

This repo has been closed, and no new changes will be accepted. Please move your content to one of these repos.

Thank you,

@lpabon lpabon closed this Jan 9, 2019
@pohly
Copy link
Contributor

pohly commented Jan 10, 2019

Here's a renewed effort by @darkowlzz to get the gRPC code enhanced: kubernetes-csi/csi-lib-utils#8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Driver registrar can sometimes take 1 minute to start
5 participants