Skip to content
This repository has been archived by the owner on Jul 21, 2021. It is now read-only.

Slow zookeeper server can cause a crash in client library #40

Closed
bnham opened this issue Nov 18, 2014 · 1 comment
Closed

Slow zookeeper server can cause a crash in client library #40

bnham opened this issue Nov 18, 2014 · 1 comment

Comments

@bnham
Copy link

bnham commented Nov 18, 2014

We recently ran into an issue in which every single server in our go cluster crashed at about the same time due to the presence of a log.Fatal in this library. The general sequence of operations was:

  1. The go-zk library connected to a slow ZK server.
  2. Authentication with the server is slow (but works). The go-zk library sends a "set watch" request, and spawns a goroutine that calls log.Fatal if the "set watch" request fails.
  3. The go-zk library enters a receive loop goroutine, which encounters a read timeout (since it sets a read deadline on its read). This read timeout propagates to the "set watch" response from (2) and causes the entire process to die due to the call to log.Fatal().

I'm not sure what the correct behavior here is, but it seems like this library shouldn't be calling log.Fatal just because a read was slow. Ideally the library would just back off and retry again on the new connection sometime later.

For more details, here's an analysis of the log lines that we saw in stderr on the affected servers:

2014/11/07 06:51:45 Failed to connect to zoo09:2181: dial tcp x.x.x.x:2181: i/o timeout
2014/11/07 06:51:55 read tcp y.y.y.y:2181: i/o timeout
2014/11/07 06:51:55 zk: connection closed

This first log line ("Failed to connect...") corresponds to a call to c.connect() in the (*Conn).loop() method:

func (c *Conn) connect() {
    // ...
    for {
        zkConn, err := c.dialer("tcp", c.servers[c.serverIndex], c.connectTimeout)
        if err == nil {
            c.conn = zkConn
            c.setState(StateConnected)
            return
        }

        log.Printf("Failed to connect to %s: %+v", c.servers[c.serverIndex], err)
        // ...
    }
}

So it looks like the library tried to connect to zoo09, and it failed printing the "Failed to connect" error. Then we slept for a while, and successfully connected to another zk server (which happens to be slow for whatever reason), exiting this loop.

The second log line corresponds to a read error in recvLoop():

func (c *Conn) recvLoop(conn net.Conn) error {
    buf := make([]byte, bufferSize)
    for {
        // package length
        conn.SetReadDeadline(time.Now().Add(c.recvTimeout))
        _, err := io.ReadFull(conn, buf[:4])
        if err != nil {
            return err
        }

When loop() receives this error, it prints it out, then returns the same error to any outstanding requests:

        // Yeesh
        if err != io.EOF && err != ErrSessionExpired && !strings.Contains(err.Error(), "use of closed network connection") {
            log.Println(err)
        }

        select {
        case <-c.shouldQuit:
            c.flushRequests(ErrClosing)
            return
        default:
        }

The problem is, there is an outstanding request being sent concurrently in sendSetWatches(), which is called by authenticate() upon successfully connecting to a server. (Note that the read that occurs in authenticate() does not time out like the later read in recvLoop() because go-zk does not set a read deadline for the read that occurs in authenticate().) Here is the relevant snippet of code:

func (c *Conn) sendSetWatches() {
    // ...
    go func() {
        res := &setWatchesResponse{}
        _, err := c.request(opSetWatches, req, res, nil)
        if err != nil {
            log.Fatal(err)
        }
    }()
}

Since all outstanding requests returned a timeout error, the log.Fatal is triggered, which caused the server to die.

@samuel samuel closed this as completed in 8499ec2 Nov 19, 2014
@samuel
Copy link
Owner

samuel commented Nov 19, 2014

Thanks for the detailed report. I added a test for slow servers and changed that log.Fatal to a log.Printf. I think that should be fine because the recvLoop handles the error by closing the connection anyway causing another server to be picked.

There's probably other issues around slow servers that the library isn't handling very well, but I agree that the Fatal really should not be there. Sorry it caused such a wide problem.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants