session.Done() returns even when all etcd servers in a cluster are online #8181

sangleganesh · 2017-06-27T18:42:43Z

We followed the code in #6699 and are using concurrency.Session to detect loss of connectivity to all etcd servers.

The way we do it is to start a watch and in the watch function start a Session.
The assumption is that if all etcd servers in a cluster go down Session.Done() returns indicating loss of connection to etcd and we exit the process.

In the following code, sometime (not always) as soon as the process restarts (about 3-8 seconds), session.Done() returns an error despite the fact the all servers in the etcd cluster are alive.
How can I log more information about it - the reason Done was returned ?
Can it be called if etcd client switches connection to some other etcd server in cluster ?

Note that we moved the etcd client code to 3.2.0 release so that it could do load balancing (and automatically move the connections without disrupting the watches) to other etcd servers when an etcd server goes down.

Any help is greatly appreciated.

The etcd server version is: etcdv3.03
client library version: 3.2.0

func watchFn(key string) error {

session, err := concurrency.NewSession(
        kvClient,
        concurrency.WithTTL(120))
    if err != nil {
        logrus.Errorf("Failed to establish session for etcd client watch: %v", err)
        return err
    }


 ctx, watchCancel := context.WithCancel(context.Background())
    watchRet := make(chan error)
    watchChan := et.kvClient.Watch(ctx, key, opts...)
    go func() {
        for wresp := range watchChan {
            if wresp.Created == true {
                continue
            }
            if wresp.Canceled == true {
                // Watch is canceled. Notify the watcher
                logrus.Errorf("Watch on key %v cancelled. Error: %v", key,
                    wresp.Err())
                // return error
            } else {
                for _, ev := range wresp.Events {
                // handle watch update
                }
            }
        }
    }()

     select {
    case <-session.Done(): // closed by etcd
        // Close the context
        watchCancel()
        // Indicate the caller that watch has been canceled
        logrus.Errorf("Watch closing session")
    case err := <-watchRet: // error in watcher
        logrus.Errorf("Watch for %v stopped: %v", key, err)
    }

   return nil
}

Opening an issue as suggested in this forum:
https://groups.google.com/forum/?hl=en#!topic/etcd-dev/0qXtLqRfLTk

The text was updated successfully, but these errors were encountered:

heyitsanthony · 2017-07-13T20:45:41Z

How can I log more information about it - the reason Done was returned ?

The channel for Done() closes if the client does not receive a keepalive response from the etcd server within the TTL limit.

Can it be called if etcd client switches connection to some other etcd server in cluster ?

If it switches to a member that is partitioned from the rest of the cluster then it could possibly time out waiting for keepalives.

The etcd server version is: etcdv3.03

3.0.3? There have been some lease fixes to 3.0.x since then:
395bd23 (possibly hitting this one)
2e92779
07e421d
842145e

sangleganesh · 2017-07-14T01:16:08Z

thanks @heyitsanthony

The Done() was called within 3-8 seconds - (and the TTL was 120 (120 seconds)).
The etcd cluster was healthy when session was closed
Will retry with latest versions of etcd server.

gyuho · 2017-12-22T02:35:52Z

@sangleganesh Any updates?

sangleganesh · 2017-12-22T02:44:45Z

we just worked around the issue. closing the bug.

xiang90 · 2017-12-22T02:49:29Z

@sangleganesh I am not happy with workaround. If this is an issue, we should get it fixed.

gyuho · 2017-12-22T02:50:18Z

@sangleganesh Have you tried the latest etcd?

dvonthenen · 2018-02-15T02:05:29Z

@sangleganesh @gyuho @xiang90 I have verified the issue with the reported server version v3.0.3 and client library version v3.2.0.

I have also retested using the server and client version v3.3.1 and the issue is reproducible in that version as well. Falls into the case <-ctx.Done(): // user cancel where the error is context deadline exceeded. Looking into this a little more.

xiang90 · 2018-02-15T03:46:59Z

@dvonthenen

great. thanks for looking into this.

tydra-wang · 2021-06-21T03:28:25Z

maybe the same problem.

I have a three-node etcd cluster. session.Done() in client often returns when I kill one node of etcd.

sakateka · 2021-07-27T18:39:58Z

I have investigated the given issue.
This is due to the classic data race in a distributed system.

The problem is reproducible even in the latest 3.5.0 release, for reproduction it is enough to create a client and try to get a sufficiently large number of sessions (to get a lease), the problem may appear even when getting the first lease
These steps are automated here

At some point, a situation will occur in the cluster in which the master will send a request to commit changes to the lease repository , but he himself has not yet applied these changes.
Before applying the changes on the master, a request comes from the client with a request to get the TTL of this lease (but the master has not yet added it to himself), which will eventually lead to an error - the lease was not found.

In my view, there are two solutions to this problem.
1.good - do not end LeaseGrant call until changes are applied to the master
2. bad, which will only hide the problem - when creating a session, make the first request for renewal of the lease after half of the lease lifetime.

serathius · 2022-06-13T11:28:57Z

I think this was addressed by @ahrtr. Can you confirm?

ahrtr · 2022-06-13T11:34:14Z

This issue should have been resolved in 3.5.3 and the main branch. Please see 13932 .

Please feel free to reopen this issue if you still can reproduce this on etcd 3.5.3+.

heyitsanthony self-assigned this Jul 18, 2017

xiang90 added this to the v3.4.0 milestone Nov 26, 2017

xiang90 unassigned heyitsanthony Nov 26, 2017

sangleganesh closed this as completed Dec 22, 2017

xiang90 reopened this Dec 22, 2017

dvonthenen mentioned this issue Feb 15, 2018

2018Q1 coreos/etcd contributions thecodeteam/roadmap#225

Open

9 tasks

gyuho added area/clientv3 stage/investigating labels Feb 25, 2018

ccressent mentioned this issue Jun 13, 2019

Cancelled contexts in watchers cause high CPU usage sensu/sensu-go#3012

Closed

3 tasks

gyuho modified the milestones: etcd-v3.4, etcd-v3.5 Aug 5, 2019

serathius removed this from the etcd-v3.5 milestone Dec 15, 2021

serathius added stage/tracked and removed stage/investigating labels Jun 13, 2022

ahrtr closed this as completed Jun 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

session.Done() returns even when all etcd servers in a cluster are online #8181

session.Done() returns even when all etcd servers in a cluster are online #8181

sangleganesh commented Jun 27, 2017

heyitsanthony commented Jul 13, 2017

sangleganesh commented Jul 14, 2017

gyuho commented Dec 22, 2017

sangleganesh commented Dec 22, 2017

xiang90 commented Dec 22, 2017

gyuho commented Dec 22, 2017

dvonthenen commented Feb 15, 2018 •

edited

Loading

xiang90 commented Feb 15, 2018

tydra-wang commented Jun 21, 2021

sakateka commented Jul 27, 2021

serathius commented Jun 13, 2022

ahrtr commented Jun 13, 2022

session.Done() returns even when all etcd servers in a cluster are online #8181

session.Done() returns even when all etcd servers in a cluster are online #8181

Comments

sangleganesh commented Jun 27, 2017

heyitsanthony commented Jul 13, 2017

sangleganesh commented Jul 14, 2017

gyuho commented Dec 22, 2017

sangleganesh commented Dec 22, 2017

xiang90 commented Dec 22, 2017

gyuho commented Dec 22, 2017

dvonthenen commented Feb 15, 2018 • edited Loading

xiang90 commented Feb 15, 2018

tydra-wang commented Jun 21, 2021

sakateka commented Jul 27, 2021

serathius commented Jun 13, 2022

ahrtr commented Jun 13, 2022

dvonthenen commented Feb 15, 2018 •

edited

Loading