BLPop server timeout versus sync/async connection timeout

When you use BLPop with a nonzero server timeout, there is no check against the connection sync/async timeout.
I think the library should check against this and throw an exception if you try to use a non zero timeout that exceeds the sync/async timeout.

The reason for this is it caused me hours of debugging before I realized something detrimental that is happening as a side effect.

Let's consider an easy mistake to make, the same mistake I did.  The BLPop notes that you need a longer sync/async timeout than server timeout, but the sync/async timeout is in milliseconds. Meanwhile there is no indication that BLPop server side timeout is actually a double in seconds. So while I had my sync/async timeout at 120000 milliseconds, I set my BLPop to 60000 thinking it was in milliseconds as well.  Here's where things get interesting.

I was already expecting redis timeouts as a possibility, so I just handled them gracefully in the same way that a server timeout would happen, and let the connection try again for another event.  However, something very curious happens when a sync/async timeout occurs. The multiplexer behind all this appears to treat the connection as no longer valid, but still keeps a reference to it, presumably because the BLPop is still waiting. Then it seems to "reconnect" and make a new set of sockets for normal and pub/sub connections under the hood, this increases the redis client connection count.

The net effect of this is that for a connection that was long lived in my case, every 2 minutes it will hit the connection timeout (unintentionally in my case because the server timeout was much higher than the connection timeout) and it starts a new set of sockets for the connection. Meanwhile the old sockets remain open on redis based on `client list` details.

In my case, over a 12 hour period I racked up 5000 redis connection that I could not account for, and it seems it's due to reconnections of existing connections after timeouts during blocking operations.  I think what happened is despite the connect async/sync timeout being hit, the connection is still alive and the BLPop is still "stuck" until the nearly 17 hour timeout from 60000 seconds. Eventually they may have finally got out of that BLPop and closed the connection, but ultimately this all happened because of a bad assumption about seconds versus milliseconds and the library could check for this condition to ensure you don't have server timeouts that exceed the connection timeouts causing extra sockets to hang out when it reconnects.  Simply put it doesn't make sense to have a blocking timeout that exceeds the connection timeouts, which also puts support for zero/infinite timeout into question when piggybacking off the SE redis library.

Or, perhaps there is a way in BLPop to detect that the multiplexer has started new sockets and the old ones are waiting to terminate because the BLPop is holding it open, then just unblock and return null like a server timeout would, because the connection is trying to terminate and BLPop is holding it open.

I am not sure about use cases of zero for BLPop server timeout, but I suspect this would cause that situation even more grief with default async/sync timeout settings and causing more background connections that last until the BLPop is released some other way, which could really rack up a lot of connections quickly and over very long periods if more and more connections keep getting made trying to pop the same key.

For what it's worth, I also believe that again in my case, I completely lost events because FIFO ensured the oldest connections got the event and then I believe they left the BLPop with an exception after receiving the event, so there was no return value to process then they closed and the next in line got the next event, but if connections racked up faster than events came in then it never catches up to the current connection after that and new connections just keep adding up stuck in the long wait for an event.

tl;dr

Make sure blocking operations that pass a non zero server timeout cannot pass a timeout that exceeds the connection sync/async timeout.  BLPop should check against connection sync timeout, and BLPopAsync should check against connection async timeout.  Allowing blocking timeouts that exceed connection timeouts opens a hole for potentially losing data and incurring excessive redis connection hanging.

Easy reproduction by setting connection timeout to 60 seconds, and then do BLPop with a server timeout of 60000 and watch the connections add up.  Data will disappear if you wait a few minutes, then PUSH an event externally to the key, and watch the event won't be received by your old connection BLPop code, and your newest connection won't see it either.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BLPop server timeout versus sync/async connection timeout #421

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

BLPop server timeout versus sync/async connection timeout #421

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions