Node driver registrar says "Lost connection" and do nothing #139

jim3ma · 2021-03-26T07:24:13Z

Log:

I0323 21:20:55.915743       1 main.go:110] Version: unknown
I0323 21:20:55.915816       1 main.go:120] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0323 21:20:55.915833       1 connection.go:151] Connecting to unix:///csi/csi.sock
I0323 21:20:55.916332       1 main.go:127] Calling CSI driver to discover driver name
I0323 21:20:55.916348       1 connection.go:180] GRPC call: /csi.v1.Identity/GetPluginInfo
I0323 21:20:55.916354       1 connection.go:181] GRPC request: {}
I0323 21:20:55.917907       1 connection.go:183] GRPC response: {"name":"x","vendor_version":"0.3.0"}
I0323 21:20:55.918413       1 connection.go:184] GRPC error: <nil>
I0323 21:20:55.918421       1 main.go:137] CSI driver name: "x"
I0323 21:20:55.918477       1 node_register.go:58] Starting Registration Server at: /registration/x-reg.sock
I0323 21:20:55.918603       1 node_register.go:67] Registration Server started at: /registration/x-reg.sock
I0323 21:20:57.567609       1 main.go:77] Received GetInfo call: &InfoRequest{}
I0323 21:20:57.671681       1 main.go:87] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
E0323 21:21:15.690492       1 connection.go:129] Lost connection to unix:///csi/csi.sock.

After restart container, it works.

The text was updated successfully, but these errors were encountered:

a1e11275 Merge pull request #139 from pohly/kind-for-kubernetes-latest 1c0fb096 prow.sh: use KinD main for latest Kubernetes 1d77cfcb Merge pull request #138 from pohly/kind-update-0.10 bff2fb7e prow.sh: KinD 0.10.0 git-subtree-dir: release-tools git-subtree-split: a1e11275b5a4febd6ad21beeac730e22c579825b

fejta-bot · 2021-06-24T07:36:05Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

aramase · 2021-07-12T17:19:57Z

/remove-lifecycle stale

toshipp · 2021-10-07T07:41:29Z

To restart the container automatically, #152 would help you.

yatanasov-hs · 2021-10-25T07:03:55Z

I am experiencing the same behavior. Given all 3 containers (node-driver-registrar, secrets-store and liveness-probe) are running, if secrets-store gets restarted, node-driver-registrar loses connection to unix:///csi/csi.sock.

Node-driver-registrar logs:

I1012 11:51:22.299487 1 main.go:164] Version: v2.3.0
I1012 11:51:22.299515 1 main.go:165] Running node-driver-registrar in mode=registration
I1012 11:51:22.299913 1 main.go:189] Attempting to open a gRPC connection with: "/csi/csi.sock"
I1012 11:51:22.299936 1 connection.go:154] Connecting to unix:///csi/csi.sock
I1012 11:51:22.300436 1 main.go:196] Calling CSI driver to discover driver name
I1012 11:51:22.300451 1 connection.go:183] GRPC call: /csi.v1.Identity/GetPluginInfo
I1012 11:51:22.300456 1 connection.go:184] GRPC request: {}
I1012 11:51:22.303030 1 connection.go:186] GRPC response: {"name":"secrets-store.csi.k8s.io","vendor_version":"v0.3.0"}
I1012 11:51:22.303071 1 connection.go:187] GRPC error:
I1012 11:51:22.303078 1 main.go:206] CSI driver name: "secrets-store.csi.k8s.io"
I1012 11:51:22.303115 1 node_register.go:52] Starting Registration Server at: /registration/secrets-store.csi.k8s.io-reg.sock
I1012 11:51:22.303301 1 node_register.go:61] Registration Server started at: /registration/secrets-store.csi.k8s.io-reg.sock
I1012 11:51:22.303447 1 node_register.go:91] Skipping healthz server because HTTP endpoint is set to: ""
I1012 11:51:23.540046 1 main.go:100] Received GetInfo call: &InfoRequest{}
I1012 11:51:23.540214 1 main.go:107] "Kubelet registration probe created" path="/var/lib/kubelet/plugins/csi-secrets-store/registration"
I1012 11:51:23.577520 1 main.go:118] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
E1012 11:51:25.218944 1 connection.go:132] Lost connection to unix:///csi/csi.sock.

Secret store logs:

I1012 11:53:06.149183 1 exporter.go:33] metrics backend: prometheus
I1012 11:53:06.205017 1 main.go:169] "provider health check enabled" interval="1m0s"
I1012 11:53:06.205636 1 secrets-store.go:40] "Initializing Secrets Store CSI Driver" driver="secrets-store.csi.k8s.io" version="v0.3.0" buildTime="2021-08-30-18:38"
I1012 11:53:06.205852 1 server.go:117] "Listening for connections" address="//csi/csi.sock"
I1012 11:53:06.205923 1 main.go:174] starting manager
I1012 11:53:06.206014 1 reconciler.go:131] starting rotation reconciler with poll interval: 10s

Liveness probe logs:

I1012 11:51:22.555987 I1012 11:51:22.558382 I1012 11:51:22.558401 E1012 11:52:05.885988 W1012 11:52:12.886013 E1012 11:52:20.885852 W1012 11:52:22.885321 W1012 11:52:27.886239 W1012 11:52:32.885371 E1012 11:52:35.885665 W1012 11:52:37.886239 W1012 11:52:42.885652 W1012 11:52:42.885679 W1012 11:52:47.885302 E1012 11:52:50.885540 W1012 11:52:52.885269 W1012 11:52:52.886357 W1012 11:52:57.885542 W1012 11:52:57.885565 W1012 11:53:02.885251 W1012 11:53:02.885407 E1012 11:53:05.885994 1 main.go:149] calling CSI driver to discover driver name
1 main.go:155] CSI driver name: "secrets-store.csi.k8s.io"
1 main.go:183] ServeMux listening at "0.0.0.0:9808"
1 main.go:64] failed to establish connection to CSI driver: context deadline exceeded
1 connection.go:173] Still connecting to unix:///csi/csi.sock
1 main.go:64] failed to establish connection to CSI driver: context deadline exceeded
1 connection.go:173] Still connecting to unix:///csi/csi.sock
1 connection.go:173] Still connecting to unix:///csi/csi.sock
1 connection.go:173] Still connecting to unix:///csi/csi.sock
1 main.go:64] failed to establish connection to CSI driver: context deadline exceeded
1 connection.go:173] Still connecting to unix:///csi/csi.sock
1 connection.go:173] Still connecting to unix:///csi/csi.sock
1 connection.go:173] Still connecting to unix:///csi/csi.sock
1 connection.go:173] Still connecting to unix:///csi/csi.sock
1 main.go:64] failed to establish connection to CSI driver: context deadline exceeded
1 connection.go:173] Still connecting to unix:///csi/csi.sock
1 connection.go:173] Still connecting to unix:///csi/csi.sock
1 connection.go:173] Still connecting to unix:///csi/csi.sock
1 connection.go:173] Still connecting to unix:///csi/csi.sock
1 connection.go:173] Still connecting to unix:///csi/csi.sock
1 connection.go:173] Still connecting to unix:///csi/csi.sock
1 main.go:64] failed to establish connection to CSI driver: context deadline exceeded

To restart the container automatically, #152 would help you.

That does not help, already have this:

name: node-driver-registrar
image: "..."
args:
- --v=5
- --csi-address=/csi/csi.sock
- --kubelet-registration-path=/var/lib/kubelet/plugins/csi-secrets-store/csi.sock
livenessProbe:
exec:
command:
- /csi-node-driver-registrar
- --kubelet-registration-path=/var/lib/kubelet/plugins/csi-secrets-store/csi.sock
- --mode=kubelet-registration-probe
initialDelaySeconds: 3

mauriciopoppe · 2021-10-25T18:28:34Z

To restart the container automatically, #152 would help you.

This helps during the startup of node-driver-registrar if the kubelet plugin registration fails but these logs show that the kubelet plugin registration succeeded so the workaround in #152 can't help here:

I0323 21:20:57.567609       1 main.go:77] Received GetInfo call: &InfoRequest{}
I0323 21:20:57.671681       1 main.go:87] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}

@yatanasov-hs thanks for the logs, I see that liveness probe attempted to connect after node-driver-registrar logged lost connection, looking at the timestamps:

# node-driver-registrar
E1012 11:51:25.218944 1 connection.go:132] Lost connection to unix:///csi/csi.sock

# livenessprobe
E1012 11:52:05.885988 1 main.go:64] failed to establish connection to CSI driver: context deadline exceeded
W1012 11:52:12.886013 1 connection.go:173] Still connecting to unix:///csi/csi.sock

The error Lost connection to {{address}} comes from https://github.com/kubernetes-csi/csi-lib-utils/blob/a2ccb594bb74b61a0655d3431c6365a6a567c7af/connection/connection.go?#L132, which by default will attempt to reconnect, I think I can add a log after a successful dial to rule out the dial timeout.

Both node-driver-registrar or livenessprobe should be able to reestablish the connection to csi.sock if they lost it so still not sure why they couldn't.

chelobarreto · 2021-10-26T11:55:13Z

I'm having same problem after upgrade okd 4.7 to 4.8.

k8s-triage-robot · 2022-01-24T12:11:43Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-02-23T13:09:31Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-03-25T13:46:17Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-03-25T13:46:19Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fradeve · 2023-05-31T14:54:36Z

Hi, I happened to see the same behaviour on k8s 1.24 and quay.io/k8scsi/csi-node-driver-registrar:v1.2.0, any idea of what might be causing this?

mauriciopoppe · 2023-06-01T04:20:29Z

/reopen
/remove-lifecycle rotten
/lifecycle frozen
/kind bug

k8s-ci-robot · 2023-06-01T04:20:31Z

@mauriciopoppe: Reopened this issue.

In response to this:

/reopen
/remove-lifecycle rotten
/lifecycle frozen
/kind bug

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

BertelBB · 2023-10-26T13:15:49Z

I got the same issue on AKS cluster in version 1.27.3. The csi-azurefile-node pods started failing on one of the node pools (1 of 3 node pools). Still looking into what caused it but best guess atm is something failing to startup properly after node upgrade during automatic maintenance window.

Rohlik · 2023-11-15T09:46:04Z

I am also observing this on our AWS EKS 1.27.4. Exactly the same output as the comment above.

basaveswar-kureti · 2023-12-13T13:48:50Z

@gnufied Does this PR #322 fixes the current(#139) issue ?

gnufied · 2023-12-13T17:57:16Z

It should help with yes.

basaveswar-kureti · 2023-12-13T18:03:52Z

Thanks @gnufied for the confirmation.
We tested this internally on a bunch of our k8s clusters which is using csi-node-driver-registrar version of v2.9.2 (so anything >v2.9.0) has not seen this issue.

mauriciopoppe · 2023-12-13T18:42:30Z

Thanks @gnufied, I wanted to expand on the effect of #322 (released in https://github.com/kubernetes-csi/node-driver-registrar/releases/tag/v2.9.0) and this issue.

The possible scenario where we might see errors like #139 (comment) or #139 (comment):

E1012 11:52:05.885988 1 main.go:64] failed to establish connection to CSI driver: context deadline exceeded
W1012 11:52:12.886013 1 connection.go:173] Still connecting to unix:///csi/csi.sock
E1012 11:51:25.218944 1 connection.go:132] Lost connection to unix:///csi/csi.sock.

Is if the csi.sock owned by the CSI Driver node component suddenly dies which could happen because the CSI Driver container restarted. Let's assume that the CSI Driver node component was restarted, the usual behavior for node-driver-registrar is to show Lost Connection to ... and to attempt reconnection, however it looks like after the restart, the CSI Driver node component creates the socket and node-driver-registrar can see the socket file available but can't connect to it.

#322 is adding a throughout check on the /healthz endpoint so that in addition to checking the existence of the socket file node-driver-registrar makes a successful CSI call (NodeGetInfo) against the socket, if the call fails consistently then eventually the container will be restarted. Note that for this workaround the deployment manifests must enable a livenessProbe probing the /healthz endpoint, for more info about the /healthz setup please check this example in the README.

gnufied · 2024-01-10T19:26:49Z

@mauriciopoppe sorry for late response, but the way I have seen this issue to happen is:

CSI driver pod and registrar are running per-usual.
Something causes another registrar container to be created (In my case, this happened because someone was trying to debug the driver-pod and created parallel container)
When this happens, the new registrar creates a new registration socket file (which replaces old one, created in step#1) and since driver is still running, kubelet sees this new registration socket and everything works fine.
Now when this debug container is terminated, it just leaves the registration socket file. But nobody is listening on it and since registrar doesn't have a mechanism of retrying, kubelet sees driver as removed.

So #322 mainly fixes issue with stale registration sockets, assuming health check is enabled on registration socket.

Now, I know that this issue has bugs linked where csi-driver socket is stale, but in current iteration of node-registrar, we only connect to csi-driver socket once on startup. We are in-fact closing the connection after retrieving driver-name and if for some reason node registrar can't connect to csi driver socket, it will simply exit on startup and restart itself. But as such - #322 does not fixes stale csi driver sockets issues.

pohly mentioned this issue Mar 31, 2021

master: update release-tools #138

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 24, 2021

aramase mentioned this issue Jul 12, 2021

secrets-store dropping connection to registrar, liveness-probe after 30-45sec of deployment, only works correctly after container is restarted kubernetes-sigs/secrets-store-csi-driver#620

Closed

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 12, 2021

toshipp mentioned this issue Oct 7, 2021

Restart csi-registrar if registration is failed. topolvm/topolvm#374

Merged

ramineni mentioned this issue Oct 26, 2021

[cinder-csi-plugin] node plugin pods restart with health check failing kubernetes/cloud-provider-openstack#1674

Closed

toshipp mentioned this issue Nov 4, 2021

Support Kubernetes v1.22 topolvm/topolvm#394

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 24, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 23, 2022

k8s-ci-robot closed this as completed Mar 25, 2022

toshipp mentioned this issue Aug 15, 2022

CSI registrar fails to get Node info topolvm/topolvm#548

Closed

k8s-ci-robot reopened this Jun 1, 2023

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jun 1, 2023

zxh326 mentioned this issue Apr 3, 2024

Upgrade default csi sidecar image juicedata/charts#97

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node driver registrar says "Lost connection" and do nothing #139

Node driver registrar says "Lost connection" and do nothing #139

jim3ma commented Mar 26, 2021 •

edited

Loading

fejta-bot commented Jun 24, 2021

aramase commented Jul 12, 2021

toshipp commented Oct 7, 2021

yatanasov-hs commented Oct 25, 2021

mauriciopoppe commented Oct 25, 2021

chelobarreto commented Oct 26, 2021

k8s-triage-robot commented Jan 24, 2022

k8s-triage-robot commented Feb 23, 2022

k8s-triage-robot commented Mar 25, 2022

k8s-ci-robot commented Mar 25, 2022

fradeve commented May 31, 2023

mauriciopoppe commented Jun 1, 2023

k8s-ci-robot commented Jun 1, 2023

BertelBB commented Oct 26, 2023

Rohlik commented Nov 15, 2023 •

edited

Loading

basaveswar-kureti commented Dec 13, 2023

gnufied commented Dec 13, 2023

basaveswar-kureti commented Dec 13, 2023

mauriciopoppe commented Dec 13, 2023 •

edited

Loading

gnufied commented Jan 10, 2024 •

edited

Loading

Node driver registrar says "Lost connection" and do nothing #139

Node driver registrar says "Lost connection" and do nothing #139

Comments

jim3ma commented Mar 26, 2021 • edited Loading

fejta-bot commented Jun 24, 2021

aramase commented Jul 12, 2021

toshipp commented Oct 7, 2021

yatanasov-hs commented Oct 25, 2021

mauriciopoppe commented Oct 25, 2021

chelobarreto commented Oct 26, 2021

k8s-triage-robot commented Jan 24, 2022

k8s-triage-robot commented Feb 23, 2022

k8s-triage-robot commented Mar 25, 2022

k8s-ci-robot commented Mar 25, 2022

fradeve commented May 31, 2023

mauriciopoppe commented Jun 1, 2023

k8s-ci-robot commented Jun 1, 2023

BertelBB commented Oct 26, 2023

Rohlik commented Nov 15, 2023 • edited Loading

basaveswar-kureti commented Dec 13, 2023

gnufied commented Dec 13, 2023

basaveswar-kureti commented Dec 13, 2023

mauriciopoppe commented Dec 13, 2023 • edited Loading

gnufied commented Jan 10, 2024 • edited Loading

jim3ma commented Mar 26, 2021 •

edited

Loading

Rohlik commented Nov 15, 2023 •

edited

Loading

mauriciopoppe commented Dec 13, 2023 •

edited

Loading

gnufied commented Jan 10, 2024 •

edited

Loading