You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since this propagates over federation and is synced to clients, it contributes to network traffic spam.
Essentially, the USER_SYNC command is sent at the start of a user/device beginning a session and then again after they have stopped. The process handling presence writing assumes after 5 minutes that the worker handling the sync session has died/disappeared/etc and begins the process of changing the user/device to 'offline' which then changes back to 'online' quickly after because the session isn't actually over yet.
This may happen a once and then not reappear for sometime, or may reoccur several times for many minutes. This is a manifestation of the presence writer using a 5 minute timeout when tracking that a sync worker has communicated with itself, and then doesn't hear from the sync worker again until another/same user disconnects/comes online.
Consider a scenario for a single user homeserver:
A local user comes online and begins a normal session(using the /sync endpoint with a timeout of 30 seconds).
After 5 minutes, a sync response takes a full 30 seconds(most likely because there were no sync updates to return).
The client has not completed another sync in that 30 second window, so the timeout machinery starts marking the client as offline.
On the sync handler:
The client data is passed to a list that will be checked every 10 seconds for if it's been over 10 seconds since it was added to this list
The client remains online for the rest of this section, but continues below.
On the presence writer(remember: The sync handler has not sent a USER_SYNC for over 5 minutes, so any client from that sync handler is eligible for timeout checking):
Timeout checking shows that this client isn't syncing anymore, sees that the last sync timestamp was from more(even a millisecond) than 30 seconds ago, removes this user_devices from the list and passes a now empty iterable to _combine_device_states which defaults to an offline state.
This is then passed into _update_states() where it is persisted and passed over replication/federation.
The next sync starts up, changes the state to online(from offline that was replicated to the sync handler) and sends that normally to the presence writer through set_state() not with USER_SYNC, so the presence writer still thinks the sync handler is gone and repeats the timeout handling again, depending on if this sync takes a complete 30 seconds. Item 3 above repeats until client actually goes offline for at least 30 real seconds.
Adding any additional local users will only extend the initial time before this scenario begins to 5 minutes from when the last user began syncing.
Suggested fix options:
Change USER_SYNC documentation/comments and coding to use a keep-alive style system for renewing/updating the 5 minute timeout for the sync handler. This allows not processing all clients attached to a given sync handler unnecessarily.
Note: As a downside, this will increase redis traffic slightly. My example code for this adds a keep-alive interval of 3 minutes and 45 seconds
Change the set_state() function for the PresenceHandler to renew/update the 5 minute timeout for the sync handler. This allows not processing all clients attached to a given sync handler unnecessarily.
Note: This will require a reverse lookup Dict to cross reference (user_id, device_id) to instance_id. A potential race condition will have to be dealt with: the redis call for USER_SYNC will take place after the HTTP replication call to set_state(). It may be enough to ignore updating if this reverse lookup doesn't exist yet(for the first sync of a session), as there is a 5 minute window still.
Just rip out the tracking on if a sync handler has expired altogether. I believe this would be fine in the case of presence as if a sync worker dies then the reverse proxy/load balancer should be diverting to another sync worker(and presence handling is the last thing we should be concerned about anyways). Is a sync worker 'dying' unexpectedly still a thing we are concerned about?
The text was updated successfully, but these errors were encountered:
Since this propagates over federation and is synced to clients, it contributes to network traffic spam.
Essentially, the
USER_SYNC
command is sent at the start of a user/device beginning a session and then again after they have stopped. The process handling presence writing assumes after 5 minutes that the worker handling the sync session has died/disappeared/etc and begins the process of changing the user/device to 'offline' which then changes back to 'online' quickly after because the session isn't actually over yet.This may happen a once and then not reappear for sometime, or may reoccur several times for many minutes. This is a manifestation of the presence writer using a 5 minute timeout when tracking that a sync worker has communicated with itself, and then doesn't hear from the sync worker again until another/same user disconnects/comes online.
Consider a scenario for a single user homeserver:
/sync
endpoint with a timeout of 30 seconds).USER_SYNC
for over 5 minutes, so any client from that sync handler is eligible for timeout checking):user_devices
from the list and passes a now empty iterable to_combine_device_states
which defaults to an offline state._update_states()
where it is persisted and passed over replication/federation.set_state()
not withUSER_SYNC
, so the presence writer still thinks the sync handler is gone and repeats the timeout handling again, depending on if this sync takes a complete 30 seconds. Item 3 above repeats until client actually goes offline for at least 30 real seconds.Adding any additional local users will only extend the initial time before this scenario begins to 5 minutes from when the last user began syncing.
Suggested fix options:
USER_SYNC
documentation/comments and coding to use a keep-alive style system for renewing/updating the 5 minute timeout for the sync handler. This allows not processing all clients attached to a given sync handler unnecessarily.set_state()
function for thePresenceHandler
to renew/update the 5 minute timeout for the sync handler. This allows not processing all clients attached to a given sync handler unnecessarily.USER_SYNC
will take place after the HTTP replication call toset_state()
. It may be enough to ignore updating if this reverse lookup doesn't exist yet(for the first sync of a session), as there is a 5 minute window still.The text was updated successfully, but these errors were encountered: