Skip to content

fix(libev): guard handle_read/handle_write against close() race condition#889

Open
vponomaryov wants to merge 1 commit into
scylladb:masterfrom
vponomaryov:fix-race-issue-614
Open

fix(libev): guard handle_read/handle_write against close() race condition#889
vponomaryov wants to merge 1 commit into
scylladb:masterfrom
vponomaryov:fix-race-issue-614

Conversation

@vponomaryov
Copy link
Copy Markdown

@vponomaryov vponomaryov commented May 19, 2026

When close() is called from one thread, it sets is_closed=True and closes the socket immediately. However, libev watchers are stopped asynchronously in _loop_will_run(), so handle_read()/handle_write() can still fire on the now-closed fd, causing EBADF errors that surface as ConnectionShutdown('Bad file descriptor') and prevent reconnection.

So, fix it applying following changes:

  • Early-return guards at the top of handle_read() and handle_write() that check is_closed/is_defunct before touching the socket
  • Secondary is_closed/is_defunct checks in error handlers to catch the race when close() happens between watcher dispatch and syscall
  • Peer disconnect detection (EBADF, ECONNRESET, ENOTCONN, etc.) that calls close() cleanly instead of defunct()
  • last_error preservation in close() when connected_event is unset, preventing factory() from returning dead connections

Fixes: #614

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • I added relevant tests for new features and bug fixes.
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • I have provided docstrings for the public items that I want to introduce.
  • I have adjusted the documentation in ./docs/source/.
  • I added appropriate Fixes: annotations to PR description.

@vponomaryov
Copy link
Copy Markdown
Author

@vponomaryov
Copy link
Copy Markdown
Author

vponomaryov commented May 22, 2026

Verified the test fix here:

Test run without this fix (for comparison) is here:

Comment from the "zeus" comparing the fixed test run and issue affected:

@zeus posted:

Conclusion: The target run did NOT hit the SCT-83 bug. The fix works.

Comparison

Aspect Reference run (SCT-83 bug) 8a5b983a Target run (fix branch) 0d335d55
Status FAILED PASSED
Branch origin/master origin/fix-pydrv-race-issue-614
Errors 6 ERROR events (3 SoftTimeoutEvent + 3 FailedResultEvent) 0 errors
Test longevity_test.LongevityTest.test_custom_time Same
Scylla version 2026.3.0~dev Same
Backend/Region OCI / us-ashburn-1 Same
Nodes 6 6

Healthcheck duration comparison

Reference run (with bug) — progressive degradation pattern:

  • Healthchecks №1-7: 9-10s (normal)
  • Healthcheck №8: 46s (degradation begins)
  • Healthcheck №9: 1m38s
  • Healthcheck №10: 3m20s (198s, exceeds 180s soft timeout → ERROR)
  • Healthcheck №11: 3m50s (228s, exceeds timeout → ERROR)
  • Healthcheck №12: 2m8s
  • Healthcheck №13: 1m38s
  • Healthcheck №14: 1m7s
  • Healthcheck №15: 2m33s
  • Healthcheck №16: 46s
  • Healthcheck №17: 22m33s (1351s, massively exceeds timeout → ERROR)

Target run (with fix) — consistently healthy:

  • All 20 healthchecks: 9-11s (every single one normal, no degradation)

Analysis

The SCT-83 / python-driver#614 bug manifests as a progressive increase in healthcheck duration over the course of the test, caused by a race condition in the python driver. In the reference run, healthcheck times escalated from ~10s to minutes, eventually hitting 22m33s — well beyond the 180s soft timeout.

The target run, running on branch fix-pydrv-race-issue-614, shows zero degradation — all 20 healthchecks completed in 9-11 seconds throughout the entire 4+ hour test. No SoftTimeoutEvents, no FailedResultEvents, no errors at all.

The fix completely eliminates the healthcheck timeout escalation pattern characteristic of SCT-83.

@fruch , @soyacz ^
This has proven to fix the https://scylladb.atlassian.net/browse/SCT-83 bug which is identical to the GH issue - #614

@vponomaryov
Copy link
Copy Markdown
Author

vponomaryov commented May 22, 2026

Note that the libenv tests have passed successfully - related to the PR.
The failed integration tests are the asyncio-based - not related to this PR.

@scylladb/python-driver-maint please merge and release it.

Comment on lines +367 to +371
elif self.is_closed or self.is_defunct:
# Socket was closed by another thread between the
# watcher firing and us calling send(). This is the
# race described in scylladb/python-driver#614.
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why this is necessary. Without this, we will go into else branch, call self.defunct(err) which will do nothing, and then return.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, defunct() does have its own if self.is_defunct or self.is_closed: return guard, so no crash or incorrect state transition would occur.

But, the guard is still valuable because of the following reasons:

  • Without the guard we will get missleading log message if the error is EBADF (which is in _PEER_DISCONNECT_ERRNOS), the code falls into the _PEER_DISCONNECT_ERRNOS branch first (it comes before else), logging "Connection %s closed by peer during write".
    But the socket wasn't closed by the peer - we closed it via close().

  • The _PEER_DISCONNECT_ERRNOS branch calls self.close(), which is a no-op when already closed but still acquires self.lock - a minor unnecessary overhead.

  • The guard makes the race condition handling self-documenting.
    Without it, a reader must trace through multiple branches to confirm the behavior is safe.

So, I find it rather useful than redundant.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for the explanation, I did not pay much attention to this path. Now I see a different potential issue, mentioned by Avi ( #614 (comment) )

If we are receiving EBADF, then we performed some action (in this case READ / WRITE) on a closed file descriptor. As far as I know, file descriptors can be reused. So, if another connection was opened between this one being closed and watchers exiting, it makes it possible for us to accidentaly read / write to a totally unrelated connection (or even some file opened by a user app). Am I wrong?

If this is true, then this patch hides the immediate symptom of the problem, but also hides a much more dangerous problem that could cause data corruption.

I don't know the libev connection impl that well, but maybe the proper approach is to, in close, first stop the watchers somehow and make sure that libev won't touch the fd anymore, and only then close the fd?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Situation where we accidentally read/write to a different connection is, probably, architecturally valid for C/C++ programs using raw fds, but doesn't apply in case of Python.
The Python socket abstraction prevents it.

Python's socket.socket object tracks state internally - after .close(), any operation on that object raises EBADF regardless of whether the OS reused the fd number for something else.
The self._socket reference always points to the old (closed) Python object, never to the new socket that happened to get the same fd.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

libev get a raw fd, not PYthon socket, but the watchers themselves use the Python socket so I guess this issue really can't happen.
The watchers also seem to be correctly removed from the loop in close (connection_destroyed), so they won't keep triggering forever. I think your change should be fine.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An improvement for the "defunct" function sounds as non-blocker for this PR - may be done separately.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but you still need to route error handling to defunct

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dkropachev this issue is lingering for more 6 months now, can we get it merged ? and later you can fix any small nitpicks ?

If not, I'll merge and release it myself.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dkropachev
The _PEER_DISCONNECT_ERRNOS logic sits between the guard and the defunct().

If we remove lines 367-371 and let errors route to defunct, here's what actually happens for the race case (EBADF after close():

elif self.is_closed or self.is_defunct: # REMOVED
    return
elif getattr(err, 'errno', err.args[0]) in _PEER_DISCONNECT_ERRNOS: # EBADF hits HERE
    log.debug("Connection %s closed by peer during write: %s", ...)
    self.close()
else:
    self.defunct(err) # never reached for EBADF

EBADF is in _PEER_DISCONNECT_ERRNOS, so it won't reach defunct() - it hits the peer-disconnect branch first, logs a misleading closed by peer message, and calls close() (a no-op that still acquires self.lock.

Calling the self.defunct(err) explicitly instead of the return would also be pointless because the defunct() checks if self.is_defunct or self.is_closed: return at the top and does nothing.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In short words ^: we should keep it as-is.

Comment on lines +421 to +425
elif self.is_closed or self.is_defunct:
# Socket was closed by another thread between the watcher
# firing and us calling recv(). This is the race described
# in scylladb/python-driver#614.
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this, if the error is in _PEER_DISCONNECT_ERRNOS (EBADF), the code would execute following:

self.last_error = ConnectionShutdown(
    "Connection to %s was closed by peer" % self.endpoint)
self.close()

It will overwrite the last_error with an incorrect closed by peer message when in reality close() was already called by us.
Since last_error is used by factory() to detect why a connection died, this overwrite could mask the real error.

Comment on lines +234 to +253
class LibevPeerDisconnectErrnos(unittest.TestCase):
"""Verify _PEER_DISCONNECT_ERRNOS contains expected values."""

def setUp(self):
_skip_if_unavailable(self)

def test_contains_ebadf(self):
self.assertIn(errno.EBADF, _PEER_DISCONNECT_ERRNOS)

def test_contains_econnreset(self):
self.assertIn(errno.ECONNRESET, _PEER_DISCONNECT_ERRNOS)

def test_contains_econnaborted(self):
self.assertIn(errno.ECONNABORTED, _PEER_DISCONNECT_ERRNOS)

def test_contains_enotconn(self):
self.assertIn(errno.ENOTCONN, _PEER_DISCONNECT_ERRNOS)

def test_contains_eshutdown(self):
self.assertIn(errno.ESHUTDOWN, _PEER_DISCONNECT_ERRNOS)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no point in having such tests. It just cements the current code, not testing the behavior.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind removing it

@vponomaryov vponomaryov requested a review from Lorak-mmk May 22, 2026 11:00
@vponomaryov
Copy link
Copy Markdown
Author

The effect of the python driver issue was slow health checks in SCT.
Example from a test run with the issue:
sct-run-without-fix

And here how much time it takes with the fix:
sct-run-with-fix

So, numbers tell for itself.

@Lorak-mmk
Copy link
Copy Markdown

@sylwiaszunejko could you take a look at asyncio failures? Why did they start to happen, do we have an issue open?

@sylwiaszunejko
Copy link
Copy Markdown
Collaborator

sylwiaszunejko commented May 22, 2026

@sylwiaszunejko could you take a look at asyncio failures? Why did they start to happen, do we have an issue open?

I thought it was due to switch to scylla version 2026.1 and #330, but I see it is supposed to be fixed by #884

@vponomaryov
Copy link
Copy Markdown
Author

vponomaryov commented May 22, 2026

@Lorak-mmk from the PR completeness point of view, should I remove that unit test class?

@sylwiaszunejko could you take a look at asyncio failures? Why did they start to happen, do we have an issue open?

I thought it was due to switch to scylla version 2026.1 and #330, but I see it is supposed to be fixed by #884

So, I need to rebase the PR?
Even if we know that this PR doesn't affect the failed tests?

@Lorak-mmk
Copy link
Copy Markdown

@Lorak-mmk from the PR completeness point of view, should I remove that unit test class?

Tbh I don't care that much.

So, I need to rebase the PR?

If you could then it would be great. I know you didn't touch asyncio, but let's just be sure. Change in one backend affecting another backend would not be the weirdest thing we saw in this driver.

…tion

When close() is called from one thread, it sets is_closed=True and
closes the socket immediately. However, libev watchers are stopped
asynchronously in _loop_will_run(), so handle_read()/handle_write()
can still fire on the now-closed fd, causing EBADF errors that
surface as ConnectionShutdown('Bad file descriptor') and prevent
reconnection.

So, fix it applying following changes:
- Early-return guards at the top of handle_read() and handle_write()
  that check is_closed/is_defunct before touching the socket
- Secondary is_closed/is_defunct checks in error handlers to catch
  the race when close() happens between watcher dispatch and syscall
- Peer disconnect detection (EBADF, ECONNRESET, ENOTCONN, etc.)
  that calls close() cleanly instead of defunct()
- last_error preservation in close() when connected_event is unset,
  preventing factory() from returning dead connections

Fixes: scylladb#614
@vponomaryov vponomaryov force-pushed the fix-race-issue-614 branch from 5dbc94b to ad60335 Compare May 22, 2026 13:36
@vponomaryov
Copy link
Copy Markdown
Author

Rebased.

@Lorak-mmk
Copy link
Copy Markdown

@dkropachev Please take a look at this.

Comment on lines +367 to +371
elif self.is_closed or self.is_defunct:
# Socket was closed by another thread between the
# watcher firing and us calling send(). This is the
# race described in scylladb/python-driver#614.
return
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but you still need to route error handling to defunct

@dkropachev
Copy link
Copy Markdown
Collaborator

@sylwiaszunejko could you take a look at asyncio failures? Why did they start to happen, do we have an issue open?

One of the tests was enabled that uses TLS with asyncio, the ultimate cure - #884

@fruch
Copy link
Copy Markdown

fruch commented May 25, 2026

@sylwiaszunejko could you take a look at asyncio failures? Why did they start to happen, do we have an issue open?

One of the tests was enabled that uses TLS with asyncio, the ultimate cure - #884

how asyncio is related to this fix ?

@vponomaryov vponomaryov requested a review from dkropachev May 25, 2026 10:50
@vponomaryov
Copy link
Copy Markdown
Author

@dkropachev see my answer in the respective comment thread for your comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Driver reported "[Errno 9] Bad file descriptor"

5 participants