Skip to content
This repository was archived by the owner on May 17, 2024. It is now read-only.

performance: divide raw keyspace into segments, avoid full index scans #32

Closed
wants to merge 6 commits into from

Conversation

sirupsen
Copy link
Contributor

@sirupsen sirupsen commented May 19, 2022

@erezsh putting this up for review a little early, because I think this just needs one more pass where I want to incorporate your feedback!

It's more complex than I would like :/ Especially from trying to set these properties count/checksum/start_key/end_key, etc.

This is a fairly large change. I explained to you the basic idea already of splitting min(id)...max(id) into equal-sized segments, which I think can work well. Overall, we need to never allow any full-index or table scans in the code-base which will not scale beyond 10s of millions of rows (and performs poorly here too). I've tried to guard from any being re-introduced.

A significant change from when I showed it to you is that I no longer "probe" for checkpoints. I like not probing far better, because I don't actually need the checkpoints to exist. The simpler implementation I suspect will be more reliable, but maybe there are some cases I haven't considered?

In a further iteration, we can allow setting a --bisect-column, which defaults to the --key.

With --bisection-threshold 10000 --bisection-factor 32 on rating <-> rating_del1 with 25m rows in Postgres:

  • Before: 74s
  • After: 16s (4.5x faster)
  • After w/ your concurrency PR Threading Support #30: TBD 🚀 🚀 🚀 5s or so?

(I will rebase commits prior to merging; keeping them raw because there's some old code that could be useful depending on review 👀 )

@sirupsen sirupsen requested a review from erezsh May 19, 2022 19:01
@@ -160,7 +229,7 @@ class TableDiffer:
"""

bisection_factor: int = 32 # Into how many segments to bisect per iteration
bisection_threshold: int = 1024**2 # When should we stop bisecting and compare locally (in row count)
bisection_threshold: int = 10000 # When should we stop bisecting and compare locally (in row count)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 1M is way too high when we are running this against databases far away from the machine. 10k seems more sensible. We can run some benchmarks in the future, but I think this is a safer default

# We only check beyond level > 0, because otherwise we might scan the
# entire index with COUNT(*). For large tables with billions of rows, we
# need to split the COUNT(*) by the `bisection_factor`.
if level > 0 or keyspace_size < self.bisection_threshold:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably get cleaned up into its own function now

assert b.count == 5
differ.set_initial_start_key_and_end_key(a, b)
a.compute_checksum_and_count()
b.compute_checksum_and_count()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In tests this stuff is really ugly... I can create a helper function for it

@@ -61,8 +61,12 @@ def test_basic(self):
differ = TableDiffer(10, 100)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add some more edge-cases here!

if t1.checksum != t2.checksum:
# Apply recursively
yield from self._diff_tables(t1, t2, level + 1)
yield from self._diff_tables(t1, t2, max(int(bisection_factor/2), 2), level + 1)
Copy link
Contributor Author

@sirupsen sirupsen May 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes more sense to divide this so that you set the bisection factor based on how you want to divide the first level. This means you'll always process roughly the same amount of rows in each query. This is how I would've expected it to behave by default, but others may have other expectations

Since we're always at least halving the space, it's still logarithmic.

I don't feel super strongly before we show this in real results, but I think this might be a little more performant in the average case where you have large network roundtrips, and especially on Cloud Warehouses with high latency. If there's a lot of changes, it seems to perform roughly the same on update001 locally.


# Compare each pair of corresponding segments between table1 and table2
for i, (t1, t2) in enumerate(safezip(segmented1, segmented2)):
logger.info(". " * level + f"Diffing segment {i+1}/{len(segmented1)} of size {t1.count} and {t2.count}")
logger.info(". " * level + f"Diffing segment {i+1}/{len(segmented1)} keys={t1.start_key}..{t1.end_key-1}")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to remove all the counts because they're too expensive, that's why we're just writing the keyspace here. I want to do that before computing the checksum in case it's slow, but we could move it after

# Get the count in the same index pass. Much cheaper than doing it
# separately.
select = self._make_select(columns=[Count(), Checksum(self._relevant_columns)])
result = self.database.query(select, Tuple)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably needs some error handling

@sirupsen sirupsen force-pushed the no-full-index-scans branch from 5397e6a to eed5377 Compare May 19, 2022 22:13
@sirupsen sirupsen changed the title performance: remove all full index scans, divide keyspace instead performance: divide key-space into bisection_factor pieces, avoiding all full index scans May 19, 2022
@sirupsen sirupsen changed the title performance: divide key-space into bisection_factor pieces, avoiding all full index scans performance: divide keyspace into pieces, avoiding all full index scans May 19, 2022
@sirupsen sirupsen changed the title performance: divide keyspace into pieces, avoiding all full index scans performance: divide raw keyspace into segments, avoiding all full index scans May 20, 2022
@sirupsen sirupsen changed the title performance: divide raw keyspace into segments, avoiding all full index scans performance: divide raw keyspace into segments, avoid full index scans May 20, 2022
Copy link
Contributor

@erezsh erezsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review.

@@ -80,6 +80,12 @@ def compile(self, parent_c: Compiler):
if self.where:
select += " WHERE " + " AND ".join(map(c.compile, self.where))

if self.where_or:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah this is a relic left over from the first iteration where I found checkpoints still with this method, will remove

@sirupsen
Copy link
Contributor Author

Feedback from Erez on Slack:

  • Create benchmarks with large gaps to ensure performance here is good compared to master (I think it will be!)
  • More thorough tests, e.g. when bisection_threshold < gap and other edge-cases

@sirupsen
Copy link
Contributor Author

sirupsen commented May 25, 2022

@erezsh here's fairly comprehensive benchmarks that I think pretty fairly demonstrates a whole bunch of scenarios. I can run more if you like, but I think this is good evidence we can just maintain this one algorithm—which is a nice simplification!

Table A Table B Bisection Thres Factor Rows Current MySQL Current Postgres New MySQL New Postgres
rating rating 10_000 16 1M 1.8s 0.7s 1.5s 0.9s
rating rating_del1 10_000 16 1M 6s 2.4s 1.7s 1.1s
rating rating_del1 100_000 16 1M 5.6s 2.75s 2s 1.4s
rating rating_update001p 10_000 16 1M 30s 10.8s 6.6s 5.1s
rating rating_update001p 100_000 16 1M 14.7s 11.2s 10.3s 7s
rating rating_update1p 10_000 16 1M 41.7s 20.3s 14s 10.6s
rating_gap1 rating_gap1_update0001p 1000 16 1M 132.4s 19.9s 72s 10.9s
rating_gap1 rating_gap1_update0001p 10_000 16 1M 58.6s 11.1s 46.5s 8.5s
rating_gap2 rating_gap2_update0001p 1000 16 1M 86s 23.4s 56s 13.3s
rating_gap2 rating_gap2_update0001p 10_000 16 1M 43s 12s 38.2s 8.9s

To be fair between the two branches, both were tested with a single thread. I can't wait to see the speedup of this branch with multiple threads!

As you can see the gains are bigger the fewer differences there are.

Surprising to me how much faster Postgres is of gaps. I think MySQL's B-Tree implementation with the primary key being the on-disk order is more sensitive to the gaps or something..?

FYI, I just ran commands like these:

poetry run python -m data_diff postgres://postgres:Password1@127.0.0.1/postgres rating_gap1 postgres://postgres:Password1@127.0.0.1/postgres rating_gap1_update0001p -c timestamp --bisection-threshold 1000 --bisection-factor 16 --update-column timestamp -v -j 1

@erezsh
Copy link
Contributor

erezsh commented May 26, 2022

I managed to reproduce the behavior that was troubling me. Here are the benchmarks, when running on 25m records (postgres):

Table A Table B branch time
gap3 gap3_update master 137.70 seconds
gap3 gap3_update no-scans 425.50 seconds

Just x3 times slower isn't THAT bad, but this effect would become much worse with a bigint id, or a smaller threshold.

Here's the script to generate the "gap3" tables:

const table rating_gap3 = rating

// create_indices(rating_gap1)
// create_indices(rating_gap2)
create_indices(rating_gap3)
commit()

table rating_gap3 {
    userid: int
    movieid: int
    rating: float
    timestamp: int
}

rating_gap3[id == 1000] update {id: 2147483548}
run_sql("INSERT INTO rating_gap3(id, userid, movieid, rating, timestamp) VALUES (2047483548, 1, 1, 5.0, 27)")
commit()

const table rating_gap3_update0001p = rating_gap3
rating_gap3_update0001p[random() < 0.000001] update {timestamp: timestamp + 1}
commit()

@erezsh erezsh closed this May 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants