Skip to content

websocket-client socket mode client doesn't handle reconnects properly #1379

@leifwalsh

Description

@leifwalsh

(Filling out the following details about bugs will help us solve your issue sooner.)

Reproducible in:

pip freeze | grep slack
python --version
sw_vers && uname -v # or `ver`

slack-sdk==3.21.3
Python 3.11.3
Debian 10

The Slack SDK version

slack-sdk==3.21.3

Python runtime version

Python 3.11.3

OS info

Debian 10

Steps to reproduce:

(Share the commands to run, source code, and project settings (e.g., setup.py))

  1. Use slack_sdk.socket_mode.websocket_client.SocketModeClient
  2. Let the app run for long enough
  3. Receive a websocket-level disconnect

Expected result:

The client should properly reconnect

Actual result:

The client repeatedly tries to reconnect without re-issuing a wss url, causing all connections to fail in a loop forever

The logs are roughly the same as described in slackapi/bolt-python#470. It repeatedly gets WebSocketConnectionClosedException. Specifically, websocket logs Connection to remote host was lost. - goodbye and slack_sdk.socket_mode.websocket_client logs on_error invoked (error: WebSocketConnectionClosedException, message: Connection to remote host was lost.)

I believe the problem is that SocketModeClient.is_connected() just checks whether self.current_session is not None (https://github.com/slackapi/python-slack-sdk/blob/v3.21.3/slack_sdk/socket_mode/websocket_client/__init__.py#L139). If you compare that with the builtin client, it also checks whether connection.is_active() (https://github.com/slackapi/python-slack-sdk/blob/v3.21.3/slack_sdk/socket_mode/builtin/client.py#L152) which checks if self.sock is not None (https://github.com/slackapi/python-slack-sdk/blob/v3.21.3/slack_sdk/socket_mode/builtin/connection.py#L198). I think what happens is that in a certain type of disconnect scenario, BaseSocketModeClient.connect_to_new_endpoint() doesn't call issue_new_wss_url() when it should, because it thinks we're still connected (https://github.com/slackapi/python-slack-sdk/blob/v3.21.3/slack_sdk/socket_mode/client.py#L75).

I'm not sure how reliable that is in theory, there might be race conditions still. However, empirically, it seems that websocket-client's connection class does a reasonable job of setting its own .sock attribute to None when handling disconnects. I subclassed slack_sdk.socket_mode.websocket_client.SocketModeClient and overrode is_connected() with a version that also checks if self.current_session.sock is falsey, and ran that in prod for a couple weeks, and it seems like when this happens it does correctly reconnect. Running the app in a HA configuration (at least two instances running) means that, as long as they both don't disconnect simultaneously, I think we retain at least one active connection consistently and don't drop messages.

I think the fix is probably simple: just check that relying on checking websocket._app.WebSocketApp.sock is something you're comfortable doing, and then add that condition to slack_sdk.socket_mode.websocket_client.SocketModeClient.is_connected().

Metadata

Metadata

Assignees

Labels

Version: 3xbugM-T: A confirmed bug report. Issues are confirmed when the reproduction steps are documentedsocket-mode

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions