performance: divide raw keyspace into segments, avoid full index scans #32

sirupsen · 2022-05-19T19:01:26Z

@erezsh putting this up for review a little early, because I think this just needs one more pass where I want to incorporate your feedback!

It's more complex than I would like :/ Especially from trying to set these properties count/checksum/start_key/end_key, etc.

This is a fairly large change. I explained to you the basic idea already of splitting min(id)...max(id) into equal-sized segments, which I think can work well. Overall, we need to never allow any full-index or table scans in the code-base which will not scale beyond 10s of millions of rows (and performs poorly here too). I've tried to guard from any being re-introduced.

A significant change from when I showed it to you is that I no longer "probe" for checkpoints. I like not probing far better, because I don't actually need the checkpoints to exist. The simpler implementation I suspect will be more reliable, but maybe there are some cases I haven't considered?

In a further iteration, we can allow setting a --bisect-column, which defaults to the --key.

With --bisection-threshold 10000 --bisection-factor 32 on rating <-> rating_del1 with 25m rows in Postgres:

Before: 74s
After: 16s (4.5x faster)
After w/ your concurrency PR Threading Support #30: TBD 🚀 🚀 🚀 5s or so?

(I will rebase commits prior to merging; keeping them raw because there's some old code that could be useful depending on review 👀 )

data_diff/diff_tables.py

sirupsen · 2022-05-19T19:03:13Z

data_diff/diff_tables.py

@@ -160,7 +229,7 @@ class TableDiffer:
    """

    bisection_factor: int = 32  # Into how many segments to bisect per iteration
-    bisection_threshold: int = 1024**2  # When should we stop bisecting and compare locally (in row count)
+    bisection_threshold: int = 10000  # When should we stop bisecting and compare locally (in row count)


I think 1M is way too high when we are running this against databases far away from the machine. 10k seems more sensible. We can run some benchmarks in the future, but I think this is a safer default

sirupsen · 2022-05-19T19:03:33Z

data_diff/diff_tables.py

+        # We only check beyond level > 0, because otherwise we might scan the
+        # entire index with COUNT(*). For large tables with billions of rows, we
+        # need to split the COUNT(*) by the `bisection_factor`.
+        if level > 0 or keyspace_size < self.bisection_threshold:


This should probably get cleaned up into its own function now

sirupsen · 2022-05-19T19:03:48Z

tests/test_diff_tables.py

-        assert b.count == 5
+        differ.set_initial_start_key_and_end_key(a, b)
+        a.compute_checksum_and_count()
+        b.compute_checksum_and_count()


In tests this stuff is really ugly... I can create a helper function for it

sirupsen · 2022-05-19T19:03:56Z

tests/test_diff_tables.py

@@ -61,8 +61,12 @@ def test_basic(self):
        differ = TableDiffer(10, 100)


Need to add some more edge-cases here!

sirupsen · 2022-05-19T19:06:49Z

data_diff/diff_tables.py

            if t1.checksum != t2.checksum:
-                # Apply recursively
-                yield from self._diff_tables(t1, t2, level + 1)
+                yield from self._diff_tables(t1, t2, max(int(bisection_factor/2), 2), level + 1)


I think it makes more sense to divide this so that you set the bisection factor based on how you want to divide the first level. This means you'll always process roughly the same amount of rows in each query. This is how I would've expected it to behave by default, but others may have other expectations

Since we're always at least halving the space, it's still logarithmic.

I don't feel super strongly before we show this in real results, but I think this might be a little more performant in the average case where you have large network roundtrips, and especially on Cloud Warehouses with high latency. If there's a lot of changes, it seems to perform roughly the same on update001 locally.

sirupsen · 2022-05-19T19:09:43Z

data_diff/diff_tables.py


        # Compare each pair of corresponding segments between table1 and table2
        for i, (t1, t2) in enumerate(safezip(segmented1, segmented2)):
-            logger.info(". " * level + f"Diffing segment {i+1}/{len(segmented1)} of size {t1.count} and {t2.count}")
+            logger.info(". " * level + f"Diffing segment {i+1}/{len(segmented1)} keys={t1.start_key}..{t1.end_key-1}")


We have to remove all the counts because they're too expensive, that's why we're just writing the keyspace here. I want to do that before computing the checksum in case it's slow, but we could move it after

sirupsen · 2022-05-19T19:10:06Z

data_diff/diff_tables.py

+            # Get the count in the same index pass. Much cheaper than doing it
+            # separately.
+            select = self._make_select(columns=[Count(), Checksum(self._relevant_columns)])
+            result = self.database.query(select, Tuple)


Probably needs some error handling

data_diff/diff_tables.py

erezsh

Partial review.

data_diff/diff_tables.py

erezsh · 2022-05-20T12:00:59Z

data_diff/sql.py

@@ -80,6 +80,12 @@ def compile(self, parent_c: Compiler):
        if self.where:
            select += " WHERE " + " AND ".join(map(c.compile, self.where))

+        if self.where_or:


What's this?

Ah yeah this is a relic left over from the first iteration where I found checkpoints still with this method, will remove

sirupsen · 2022-05-20T14:08:57Z

Feedback from Erez on Slack:

Create benchmarks with large gaps to ensure performance here is good compared to master (I think it will be!)
More thorough tests, e.g. when bisection_threshold < gap and other edge-cases

sirupsen · 2022-05-25T21:34:23Z

@erezsh here's fairly comprehensive benchmarks that I think pretty fairly demonstrates a whole bunch of scenarios. I can run more if you like, but I think this is good evidence we can just maintain this one algorithm—which is a nice simplification!

Table A	Table B	Bisection Thres	Factor	Rows	Current MySQL	Current Postgres	New MySQL	New Postgres
`rating`	`rating`	10_000	16	1M	1.8s	0.7s	1.5s	0.9s
`rating`	`rating_del1`	10_000	16	1M	6s	2.4s	1.7s	1.1s
`rating`	`rating_del1`	100_000	16	1M	5.6s	2.75s	2s	1.4s
`rating`	`rating_update001p`	10_000	16	1M	30s	10.8s	6.6s	5.1s
`rating`	`rating_update001p`	100_000	16	1M	14.7s	11.2s	10.3s	7s
`rating`	`rating_update1p`	10_000	16	1M	41.7s	20.3s	14s	10.6s

`rating_gap1`	`rating_gap1_update0001p`	1000	16	1M	132.4s	19.9s	72s	10.9s
`rating_gap1`	`rating_gap1_update0001p`	10_000	16	1M	58.6s	11.1s	46.5s	8.5s
`rating_gap2`	`rating_gap2_update0001p`	1000	16	1M	86s	23.4s	56s	13.3s
`rating_gap2`	`rating_gap2_update0001p`	10_000	16	1M	43s	12s	38.2s	8.9s

To be fair between the two branches, both were tested with a single thread. I can't wait to see the speedup of this branch with multiple threads!

As you can see the gains are bigger the fewer differences there are.

Surprising to me how much faster Postgres is of gaps. I think MySQL's B-Tree implementation with the primary key being the on-disk order is more sensitive to the gaps or something..?

FYI, I just ran commands like these:

poetry run python -m data_diff postgres://postgres:Password1@127.0.0.1/postgres rating_gap1 postgres://postgres:Password1@127.0.0.1/postgres rating_gap1_update0001p -c timestamp --bisection-threshold 1000 --bisection-factor 16 --update-column timestamp -v -j 1

erezsh · 2022-05-26T14:41:28Z

I managed to reproduce the behavior that was troubling me. Here are the benchmarks, when running on 25m records (postgres):

Table A	Table B	branch	time
gap3	gap3_update	master	137.70 seconds
gap3	gap3_update	no-scans	425.50 seconds

Just x3 times slower isn't THAT bad, but this effect would become much worse with a bigint id, or a smaller threshold.

Here's the script to generate the "gap3" tables:

const table rating_gap3 = rating

// create_indices(rating_gap1)
// create_indices(rating_gap2)
create_indices(rating_gap3)
commit()

table rating_gap3 {
    userid: int
    movieid: int
    rating: float
    timestamp: int
}

rating_gap3[id == 1000] update {id: 2147483548}
run_sql("INSERT INTO rating_gap3(id, userid, movieid, rating, timestamp) VALUES (2047483548, 1, 1, 5.0, 27)")
commit()

const table rating_gap3_update0001p = rating_gap3
rating_gap3_update0001p[random() < 0.000001] update {timestamp: timestamp + 1}
commit()

sirupsen added 2 commits May 19, 2022 14:45

no full index scans

cd0e5ed

no mutual

a9a30f1

sirupsen requested a review from erezsh May 19, 2022 19:01

sirupsen commented May 19, 2022

View reviewed changes

data_diff/diff_tables.py Show resolved Hide resolved

sirupsen commented May 19, 2022

View reviewed changes

data_diff/diff_tables.py Show resolved Hide resolved

sirupsen commented May 19, 2022

View reviewed changes

data_diff/diff_tables.py Show resolved Hide resolved

remove hard checkpoints; 'virtual' is ok

eed5377

sirupsen force-pushed the no-full-index-scans branch from 5397e6a to eed5377 Compare May 19, 2022 22:13

sirupsen changed the title ~~performance: remove all full index scans, divide keyspace instead~~ performance: divide key-space into bisection_factor pieces, avoiding all full index scans May 19, 2022

sirupsen changed the title ~~performance: divide key-space into bisection_factor pieces, avoiding all full index scans~~ performance: divide keyspace into pieces, avoiding all full index scans May 19, 2022

sirupsen changed the title ~~performance: divide keyspace into pieces, avoiding all full index scans~~ performance: divide raw keyspace into segments, avoiding all full index scans May 20, 2022

sirupsen changed the title ~~performance: divide raw keyspace into segments, avoiding all full index scans~~ performance: divide raw keyspace into segments, avoid full index scans May 20, 2022

erezsh suggested changes May 20, 2022

View reviewed changes

sirupsen added 2 commits May 25, 2022 11:45

don't split for now, simplify changes

871949a

off by one logging fix; log n_keys

e148e24

hack gaps

186d490

erezsh closed this May 27, 2022

		@@ -61,8 +61,12 @@ def test_basic(self):
		differ = TableDiffer(10, 100)

performance: divide raw keyspace into segments, avoid full index scans #32

performance: divide raw keyspace into segments, avoid full index scans #32

Uh oh!

Conversation

sirupsen commented May 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sirupsen May 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

erezsh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sirupsen commented May 20, 2022

Uh oh!

sirupsen commented May 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erezsh commented May 26, 2022

Uh oh!

Uh oh!

sirupsen commented May 19, 2022 •

edited

Loading

sirupsen May 19, 2022 •

edited

Loading

sirupsen commented May 25, 2022 •

edited

Loading