Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Presence state transitions incorrectly when using a sync worker(Redo) #16771

Open
realtyem opened this issue Jan 1, 2024 · 0 comments
Open

Comments

@realtyem
Copy link
Contributor

realtyem commented Jan 1, 2024

Since this propagates over federation and is synced to clients, it contributes to network traffic spam.

Essentially, the USER_SYNC command is sent at the start of a user/device beginning a session and then again after they have stopped. The process handling presence writing assumes after 5 minutes that the worker handling the sync session has died/disappeared/etc and begins the process of changing the user/device to 'offline' which then changes back to 'online' quickly after because the session isn't actually over yet.

This may happen a once and then not reappear for sometime, or may reoccur several times for many minutes. This is a manifestation of the presence writer using a 5 minute timeout when tracking that a sync worker has communicated with itself, and then doesn't hear from the sync worker again until another/same user disconnects/comes online.

Consider a scenario for a single user homeserver:

  1. A local user comes online and begins a normal session(using the /sync endpoint with a timeout of 30 seconds).
  2. After 5 minutes, a sync response takes a full 30 seconds(most likely because there were no sync updates to return).
  3. The client has not completed another sync in that 30 second window, so the timeout machinery starts marking the client as offline.
    • On the sync handler:
      1. The client data is passed to a list that will be checked every 10 seconds for if it's been over 10 seconds since it was added to this list
      2. The client remains online for the rest of this section, but continues below.
    • On the presence writer(remember: The sync handler has not sent a USER_SYNC for over 5 minutes, so any client from that sync handler is eligible for timeout checking):
      1. Timeout checking shows that this client isn't syncing anymore, sees that the last sync timestamp was from more(even a millisecond) than 30 seconds ago, removes this user_devices from the list and passes a now empty iterable to _combine_device_states which defaults to an offline state.
      2. This is then passed into _update_states() where it is persisted and passed over replication/federation.
  4. The next sync starts up, changes the state to online(from offline that was replicated to the sync handler) and sends that normally to the presence writer through set_state() not with USER_SYNC, so the presence writer still thinks the sync handler is gone and repeats the timeout handling again, depending on if this sync takes a complete 30 seconds. Item 3 above repeats until client actually goes offline for at least 30 real seconds.

Adding any additional local users will only extend the initial time before this scenario begins to 5 minutes from when the last user began syncing.

Suggested fix options:

  1. Change USER_SYNC documentation/comments and coding to use a keep-alive style system for renewing/updating the 5 minute timeout for the sync handler. This allows not processing all clients attached to a given sync handler unnecessarily.
    • Note: As a downside, this will increase redis traffic slightly. My example code for this adds a keep-alive interval of 3 minutes and 45 seconds
  2. Change the set_state() function for the PresenceHandler to renew/update the 5 minute timeout for the sync handler. This allows not processing all clients attached to a given sync handler unnecessarily.
    • Note: This will require a reverse lookup Dict to cross reference (user_id, device_id) to instance_id. A potential race condition will have to be dealt with: the redis call for USER_SYNC will take place after the HTTP replication call to set_state(). It may be enough to ignore updating if this reverse lookup doesn't exist yet(for the first sync of a session), as there is a 5 minute window still.
  3. Just rip out the tracking on if a sync handler has expired altogether. I believe this would be fine in the case of presence as if a sync worker dies then the reverse proxy/load balancer should be diverting to another sync worker(and presence handling is the last thing we should be concerned about anyways). Is a sync worker 'dying' unexpectedly still a thing we are concerned about?

flicker

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants