Skip to content

Sync Optimizations #19

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Aug 13, 2024
Merged

Sync Optimizations #19

merged 6 commits into from
Aug 13, 2024

Conversation

rkistner
Copy link
Contributor

@rkistner rkistner commented Aug 12, 2024

  1. Remove handling of {target: ...} in MOVE operations. This has been removed from the protocol, and has no active
    implementations. See protocol docs for MOVE.
  2. Never persist MOVE operations - we only care about the checksum.
  3. Do not persist REMOVE operations if any of these hold:
    1. This is an initial sync for the bucket (including adding new buckets). However, these do still need to supersede previous PUT operations.
    2. There was no previous operation supserseded (REMOVE for a PUT operation we don't have).
    3. More generally, if the only superseded operations were not applied locally yet, meaning there is nothing to remove locally.
  4. Fix a crash due to wrapping checksums (integer overflow) in debug builds. Release builds did not have this issue.
  5. When receiving a new operation for a row, instead of marking the previous operation as superseded, delete it.

Combined, these optimizations could help to significantly speed up initial sync of compacted buckets, where a large percentage of operations are MOVE or REMOVE operations.

This will not have a significant performance impact if:

  1. The bucket is not compacted at all (meaning no MOVE operations).
  2. The bucket is fully defragmented (meaning only PUT operations).

Benchmark 1 - many MOVE and REMOVE ops

Test case:

  • Local powersync-service
  • 122,422 total operations (30MB downloaded), with:
    • 2,422 PUT operations (13MB of data)
    • 60,000 MOVE operations
    • 60,000 REMOVE operations

Dart native, Linux desktop

Before: 5.7s
After: 2.7s

Diagnostics app (web sdk)

Disabled dynamic schema generation.

Before: 74s for saving the data, 490s (!) for compacting
After: 7.8s

The big speedup is likely from not filling ps_oplog with many REMOVE operations. Will need further investigation to determine why it was so slow, since we may still get this kind of performance for other cases with a large number of operations.

Benchmark 2 - many PUT ops for small number of rows

Test case:

  • Local powersync-service
  • 60k total operations (21MB downloaded), over 20 rows

Dart native, Linux desktop

Before: 2.34s, 800KB db file, 4.6MB WAL
After: 2.07s, 4KB db file, 2.6MB WAL

This gives around 10% performance improvement, but with significantly reduced storage usage.

Diagnostics app (web sdk)

Before: 77s, 2.2MB storage
After: 37s, 4.5MB storage

It's not clear why the storage increased in this case.

Future Work

Remove superseded column

We can completely remove the superseded column - it is now always 0. This will require some semi-tricky migrations, so we're not doing it just yet (there may be existing data where superseded = 1).

Optimize compacting

Now that superseded operations are immediately deleted, we can also optimize the compact operations (clear_remove_ops). By combining this with the SET last_applied_op = last_op part, we can significantly reduce the number of rows we need to scan for REMOVE operations after incremental updates. This can give us continuous auto-compacting, instead of the current "compact once every 1000 operations".

Normalize bucket names

Bucket names are primarily used when saving and superseding operations. We already store each synced bucket in ps_buckets. We can use those ids in ps_oplog, instead of the full bucket names. This could reduce storage size and increase performance.

@rkistner rkistner marked this pull request as ready for review August 13, 2024 08:51
Copy link
Contributor

@stevensJourney stevensJourney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are some amazing performance improvements.

@rkistner rkistner merged commit 1504864 into main Aug 13, 2024
11 checks passed
@rkistner rkistner deleted the protocol-cleanup branch August 13, 2024 13:56
@rkistner
Copy link
Contributor Author

Will get a new release with these changes out next week.

Note that for the web SDK, the changes in powersync-ja/powersync-js#266 fix the biggest performance issues, but these changes will give further improvements for buckets with many more operations than actual rows.

@rkistner rkistner self-assigned this Aug 14, 2024
This was referenced Aug 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants