You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jul 21, 2021. It is now read-only.
We recently ran into an issue in which every single server in our go cluster crashed at about the same time due to the presence of a log.Fatal in this library. The general sequence of operations was:
The go-zk library connected to a slow ZK server.
Authentication with the server is slow (but works). The go-zk library sends a "set watch" request, and spawns a goroutine that calls log.Fatal if the "set watch" request fails.
The go-zk library enters a receive loop goroutine, which encounters a read timeout (since it sets a read deadline on its read). This read timeout propagates to the "set watch" response from (2) and causes the entire process to die due to the call to log.Fatal().
I'm not sure what the correct behavior here is, but it seems like this library shouldn't be calling log.Fatal just because a read was slow. Ideally the library would just back off and retry again on the new connection sometime later.
For more details, here's an analysis of the log lines that we saw in stderr on the affected servers:
This first log line ("Failed to connect...") corresponds to a call to c.connect() in the (*Conn).loop() method:
func (c *Conn) connect() {
// ...
for {
zkConn, err := c.dialer("tcp", c.servers[c.serverIndex], c.connectTimeout)
if err == nil {
c.conn = zkConn
c.setState(StateConnected)
return
}
log.Printf("Failed to connect to %s: %+v", c.servers[c.serverIndex], err)
// ...
}
}
So it looks like the library tried to connect to zoo09, and it failed printing the "Failed to connect" error. Then we slept for a while, and successfully connected to another zk server (which happens to be slow for whatever reason), exiting this loop.
The second log line corresponds to a read error in recvLoop():
When loop() receives this error, it prints it out, then returns the same error to any outstanding requests:
// Yeesh
if err != io.EOF && err != ErrSessionExpired && !strings.Contains(err.Error(), "use of closed network connection") {
log.Println(err)
}
select {
case <-c.shouldQuit:
c.flushRequests(ErrClosing)
return
default:
}
The problem is, there is an outstanding request being sent concurrently in sendSetWatches(), which is called by authenticate() upon successfully connecting to a server. (Note that the read that occurs in authenticate() does not time out like the later read in recvLoop() because go-zk does not set a read deadline for the read that occurs in authenticate().) Here is the relevant snippet of code:
func (c *Conn) sendSetWatches() {
// ...
go func() {
res := &setWatchesResponse{}
_, err := c.request(opSetWatches, req, res, nil)
if err != nil {
log.Fatal(err)
}
}()
}
Since all outstanding requests returned a timeout error, the log.Fatal is triggered, which caused the server to die.
The text was updated successfully, but these errors were encountered:
Thanks for the detailed report. I added a test for slow servers and changed that log.Fatal to a log.Printf. I think that should be fine because the recvLoop handles the error by closing the connection anyway causing another server to be picked.
There's probably other issues around slow servers that the library isn't handling very well, but I agree that the Fatal really should not be there. Sorry it caused such a wide problem.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
We recently ran into an issue in which every single server in our go cluster crashed at about the same time due to the presence of a
log.Fatal
in this library. The general sequence of operations was:I'm not sure what the correct behavior here is, but it seems like this library shouldn't be calling log.Fatal just because a read was slow. Ideally the library would just back off and retry again on the new connection sometime later.
For more details, here's an analysis of the log lines that we saw in stderr on the affected servers:
This first log line ("Failed to connect...") corresponds to a call to c.connect() in the
(*Conn).loop()
method:So it looks like the library tried to connect to zoo09, and it failed printing the "Failed to connect" error. Then we slept for a while, and successfully connected to another zk server (which happens to be slow for whatever reason), exiting this loop.
The second log line corresponds to a read error in
recvLoop()
:When
loop()
receives this error, it prints it out, then returns the same error to any outstanding requests:The problem is, there is an outstanding request being sent concurrently in
sendSetWatches()
, which is called byauthenticate()
upon successfully connecting to a server. (Note that the read that occurs inauthenticate()
does not time out like the later read inrecvLoop()
because go-zk does not set a read deadline for the read that occurs inauthenticate()
.) Here is the relevant snippet of code:Since all outstanding requests returned a timeout error, the log.Fatal is triggered, which caused the server to die.
The text was updated successfully, but these errors were encountered: