-
Notifications
You must be signed in to change notification settings - Fork 288
performance: divide raw keyspace into segments, avoid full index scans #32
Conversation
@@ -160,7 +229,7 @@ class TableDiffer: | |||
""" | |||
|
|||
bisection_factor: int = 32 # Into how many segments to bisect per iteration | |||
bisection_threshold: int = 1024**2 # When should we stop bisecting and compare locally (in row count) | |||
bisection_threshold: int = 10000 # When should we stop bisecting and compare locally (in row count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 1M is way too high when we are running this against databases far away from the machine. 10k seems more sensible. We can run some benchmarks in the future, but I think this is a safer default
# We only check beyond level > 0, because otherwise we might scan the | ||
# entire index with COUNT(*). For large tables with billions of rows, we | ||
# need to split the COUNT(*) by the `bisection_factor`. | ||
if level > 0 or keyspace_size < self.bisection_threshold: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably get cleaned up into its own function now
assert b.count == 5 | ||
differ.set_initial_start_key_and_end_key(a, b) | ||
a.compute_checksum_and_count() | ||
b.compute_checksum_and_count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In tests this stuff is really ugly... I can create a helper function for it
@@ -61,8 +61,12 @@ def test_basic(self): | |||
differ = TableDiffer(10, 100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add some more edge-cases here!
data_diff/diff_tables.py
Outdated
if t1.checksum != t2.checksum: | ||
# Apply recursively | ||
yield from self._diff_tables(t1, t2, level + 1) | ||
yield from self._diff_tables(t1, t2, max(int(bisection_factor/2), 2), level + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes more sense to divide this so that you set the bisection factor based on how you want to divide the first level. This means you'll always process roughly the same amount of rows in each query. This is how I would've expected it to behave by default, but others may have other expectations
Since we're always at least halving the space, it's still logarithmic.
I don't feel super strongly before we show this in real results, but I think this might be a little more performant in the average case where you have large network roundtrips, and especially on Cloud Warehouses with high latency. If there's a lot of changes, it seems to perform roughly the same on update001
locally.
data_diff/diff_tables.py
Outdated
|
||
# Compare each pair of corresponding segments between table1 and table2 | ||
for i, (t1, t2) in enumerate(safezip(segmented1, segmented2)): | ||
logger.info(". " * level + f"Diffing segment {i+1}/{len(segmented1)} of size {t1.count} and {t2.count}") | ||
logger.info(". " * level + f"Diffing segment {i+1}/{len(segmented1)} keys={t1.start_key}..{t1.end_key-1}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to remove all the counts because they're too expensive, that's why we're just writing the keyspace here. I want to do that before computing the checksum in case it's slow, but we could move it after
# Get the count in the same index pass. Much cheaper than doing it | ||
# separately. | ||
select = self._make_select(columns=[Count(), Checksum(self._relevant_columns)]) | ||
result = self.database.query(select, Tuple) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably needs some error handling
5397e6a
to
eed5377
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review.
@@ -80,6 +80,12 @@ def compile(self, parent_c: Compiler): | |||
if self.where: | |||
select += " WHERE " + " AND ".join(map(c.compile, self.where)) | |||
|
|||
if self.where_or: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah this is a relic left over from the first iteration where I found checkpoints still with this method, will remove
Feedback from Erez on Slack:
|
@erezsh here's fairly comprehensive benchmarks that I think pretty fairly demonstrates a whole bunch of scenarios. I can run more if you like, but I think this is good evidence we can just maintain this one algorithm—which is a nice simplification!
To be fair between the two branches, both were tested with a single thread. I can't wait to see the speedup of this branch with multiple threads! As you can see the gains are bigger the fewer differences there are. Surprising to me how much faster Postgres is of gaps. I think MySQL's B-Tree implementation with the primary key being the on-disk order is more sensitive to the gaps or something..? FYI, I just ran commands like these:
|
I managed to reproduce the behavior that was troubling me. Here are the benchmarks, when running on 25m records (postgres):
Just x3 times slower isn't THAT bad, but this effect would become much worse with a bigint id, or a smaller threshold. Here's the script to generate the "gap3" tables: const table rating_gap3 = rating
// create_indices(rating_gap1)
// create_indices(rating_gap2)
create_indices(rating_gap3)
commit()
table rating_gap3 {
userid: int
movieid: int
rating: float
timestamp: int
}
rating_gap3[id == 1000] update {id: 2147483548}
run_sql("INSERT INTO rating_gap3(id, userid, movieid, rating, timestamp) VALUES (2047483548, 1, 1, 5.0, 27)")
commit()
const table rating_gap3_update0001p = rating_gap3
rating_gap3_update0001p[random() < 0.000001] update {timestamp: timestamp + 1}
commit() |
@erezsh putting this up for review a little early, because I think this just needs one more pass where I want to incorporate your feedback!
It's more complex than I would like :/ Especially from trying to set these properties count/checksum/start_key/end_key, etc.
This is a fairly large change. I explained to you the basic idea already of splitting
min(id)...max(id)
into equal-sized segments, which I think can work well. Overall, we need to never allow any full-index or table scans in the code-base which will not scale beyond 10s of millions of rows (and performs poorly here too). I've tried to guard from any being re-introduced.A significant change from when I showed it to you is that I no longer "probe" for checkpoints. I like not probing far better, because I don't actually need the checkpoints to exist. The simpler implementation I suspect will be more reliable, but maybe there are some cases I haven't considered?
In a further iteration, we can allow setting a
--bisect-column
, which defaults to the--key
.With
--bisection-threshold 10000 --bisection-factor 32
onrating <-> rating_del1
with 25m rows in Postgres:(I will rebase commits prior to merging; keeping them raw because there's some old code that could be useful depending on review 👀 )