Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSS: slow incremental sync when an event backlog builds up (i.e. if you're offline for a while) #3223

Closed
ara4n opened this issue Sep 3, 2024 · 3 comments
Labels

Comments

@ara4n
Copy link
Member

ara4n commented Sep 3, 2024

Steps to reproduce

  1. Go offline for a few hours/days/weeks in a busy account
  2. Launch EX
  3. Observe incr sync takes tens of seconds, proportional to the time spent offline.

Outcome

What did you expect?

Incremental sync should be O(1) not O(N) with time spent offline.

Specifically, the server should reset the SSS connection after 30m offline (or after 2000 events stack up in the backlog) to force the client to do a paginated initial sync when it next launches rather than a slow unpaginated incr sync.

In future, we should probably paginate the incr sync instead so it syncs rapidly (to avoid overloading the server with lots of unnecessary full initial syncs after every 30m of idleness), but that's a separate MSC.

What happened instead?

Slow incr sync. (In theory, although I haven't actually had a chance to spot & check this in practice - this is a theoretical vuln)

Your phone model

No response

Operating system version

No response

Application version

697

Homeserver

No response

Will you send logs?

No

@ara4n ara4n added the T-Defect label Sep 3, 2024
@erikjohnston
Copy link
Member

erikjohnston commented Sep 3, 2024

From backend point of view we should add logic to reset the connection if it looks like there are "a lot" of updates to send in response to a request. It's sub-optimal to have to do this, as we end up sending down all the old rooms all over again, wasting server and client resources and bandwidth. Though its a good stop-gap.

From a client/SDK point of view I think it'd be good to reduce the range back down to [0-19] after some time of inactivity (but not reset the connection, like we do when we see a connection error/timeout). This will then allow the server to (hopefully) respond quickly to the first request and for the client to fetch the rest of the updates in when the list grows. I don't think it really matters too much if we reduce to [0,19] relatively quickly, so I'd probably suggest a 30m timer would be a good first try.

@erikjohnston
Copy link
Member

Sounds like since the SDK doesn't persist the pos tokens, we think this case will unlikely to be hit in practice, though we should still something here.

@erikjohnston
Copy link
Member

Reduce range issue: matrix-org/matrix-rust-sdk#3935
Persist position across restart issue: matrix-org/matrix-rust-sdk#3936

Reset connection server side: element-hq/synapse#17653

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants