release-25.3: execbuilder: add "average lookup ratio" parallelization heuristic for lookup joins #150016

yuzefovich · 2025-07-11T21:11:15Z

Backport 3/3 commits from #148186 on behalf of @yuzefovich.

sql: use correct context in recursive CTE iterations

Previously, the execbuilder.Builder that we use in recursive CTE
iterations referenced the same context that the Builder captured on the
main query path. This is problematic since that context might have
a tracing span that's already been finished. This commit fixes this
issue by explicitly passing the context argument into the iteration
function. This was only exposed because we added some logging in the
following commit but has been present for a while now.

logictest: extend a few EXPLAIN tests a bit

This commit adjusts a few existing EXPLAIN tests to use VERBOSE as well as
adds another RBR setup with CASCADE FKs. These will be used to highlight
the parallelization change in the multi-key lookup joins in the
following commit.

execbuilder: add "average lookup ratio" parallelization heuristic for lookup joins

The DistSender API forces its users to make a choice between cross-range
parallelism (which is needed for performance) and setting memory limits
(which is needed for stability). Streamer was introduced to address this
limitation, but it comes with some requirements, one of which is that it
needs to have access to LeafTxns. However, mutation statements must run
in the RootTxn, so we never use the Streamer for those and fall back to
using the DistSender API directly (via txnKVFetcher). There, we
currently have the following heuristic:

if we know that each input row results in at most one lookup row, then
we consider such a lookup to be "safe" parallelization, so we disable
usage of memory limits on the BatchHeader. This is the case for index
joins (when we always expect to get exactly one looked up row) as well
as lookup joins that have "equality columns are key" property (when we
expect at most one looked up row).
otherwise, if we have a multi-key lookup join, we use the default
fetcher memory limits (TargetBytes of 10MiB), which disables cross-range
parallelism.

Most commonly this will affect mutation statements and will have a more
pronouanced effect on the multi-region tables, so this commit extends the
heuristic for when we consider it to be "safe" for parallelization.

Namely, we now calculate the average lookup ratio based on the lookup
equality columns and the available table / column statistics, and if the
ratio doesn't exceed the allowed limit, then we'll enable the
parallelism. To a certain degree, this heuristic resembles the "equality
columns are key" heuristic that we already utilize, but instead of
a guaranteed maximum on the lookup ratio we use the estimated average.

What we're trying to prevent with the existing and the new heuristics is
the case when we construct such a KV batch that the KV response will
overwhelm (read "will OOM") the node issuing the KV batch. In the
existing heuristic we say that "if lookup ratio is guaranteed to not
exceed one, then it should be safe".

I believe that the new heuristic should be safe in practice for most
deployments due to the following reasons:

we already have an implicit limiting behavior in the join reader due
to its execution model (it first buffers some number of rows, up to
2MiB in size when not using the streamer, deduplicates the lookup spans,
and performs the lookup of all those spans in a single KV batch).
Empirical testing shows that we expect to have at most 25k lookups in
that single KV batch.
this will have impact only when the streamer is not used, which most
commonly will mean we're executing a mutation, and in our docs we
advocate for not performing large mutations. (I'm stretching things
a bit here since even if we modify small amount of data, to compute that
we might read a lot, which could be destabilizing if we disable KV
limits. Yet a similar argument could be made that our current
"equality columns are key" heuristic is not safe - it's possible to
construct a scenario where we look up large amounts of data.)

In order to prevent this new heuristic from exploding in some edge
cases, two guardrails are added:

in order to handle a scenario where the lookup ratio is not evenly
distributed (i.e. different input rows can result in vastly different
number of looked up rows), we'll disable the heuristic if the max lookup
ratio exceeds the allowed limit.
in order to handle a scenario where looked rows are very large, we'll
disable the heuristic if the estimated average lookup row size exceeds
the allowed limit. (Note that we don't have this kind of protection in
the existing heuristics.)

I plan to do some more empirical runs to fine-tune the default values of
the newly added session variables, but the current defaults are:

parallelize_multi_key_lookup_joins_avg_lookup_ratio = 10
parallelize_multi_key_lookup_joins_max_lookup_ratio = 10000
parallelize_multi_key_lookup_joins_avg_lookup_row_size = 100 KiB.

In order to de-risk rollout of this feature, we will initially apply the
new heuristic only to mutations of multi-region tables. New session
variable parallelize_multi_key_lookup_joins_only_on_mr_mutations can
be set to false to apply the heuristic to all statements, regardless
of the table being multi-region.

Fixes: #134351.
Epic: CRDB-44104

Release note (performance improvement): Mutation statements (UPDATEs and
DELETEs) that perform lookup joins into multi-region tables (perhaps as
part of a CASCADE) are now more likely to parallelize the lookups across
ranges which improves their performance.

Release justification: performance improvement for innovation release with limited initial rollout.

… lookup joins The DistSender API forces its users to make a choice between cross-range parallelism (which is needed for performance) and setting memory limits (which is needed for stability). Streamer was introduced to address this limitation, but it comes with some requirements, one of which is that it needs to have access to LeafTxns. However, mutation statements must run in the RootTxn, so we never use the Streamer for those and fall back to using the DistSender API directly (via `txnKVFetcher`). There, we currently have the following heuristic: - if we know that each input row results in at most one lookup row, then we consider such a lookup to be "safe" parallelization, so we disable usage of memory limits on the BatchHeader. This is the case for index joins (when we always expect to get exactly one looked up row) as well as lookup joins that have "equality columns are key" property (when we expect at most one looked up row). - otherwise, if we have a multi-key lookup join, we use the default fetcher memory limits (TargetBytes of 10MiB), which disables cross-range parallelism. Most commonly this will affect mutation statements and will have a more pronouanced effect on the multi-region tables, so this commit extends the heuristic for when we consider it to be "safe" for parallelization. Namely, we now calculate the average lookup ratio based on the lookup equality columns and the available table / column statistics, and if the ratio doesn't exceed the allowed limit, then we'll enable the parallelism. To a certain degree, this heuristic resembles the "equality columns are key" heuristic that we already utilize, but instead of a guaranteed maximum on the lookup ratio we use the estimated average. What we're trying to prevent with the existing and the new heuristics is the case when we construct such a KV batch that the KV response will overwhelm (read "will OOM") the node issuing the KV batch. In the existing heuristic we say that "if lookup ratio is guaranteed to not exceed one, then it should be safe". I believe that the new heuristic should be safe in practice for most deployments due to the following reasons: - we already have an implicit limiting behavior in the join reader due to its execution model (it first buffers some number of rows, up to 2MiB in size when not using the streamer, deduplicates the lookup spans, and performs the lookup of all those spans in a single KV batch). Empirical testing shows that we expect to have at most 25k lookups in that single KV batch. - this will have impact only when the streamer is not used, which most commonly will mean we're executing a mutation, and in our docs we advocate for not performing large mutations. (I'm stretching things a bit here since even if we modify small amount of data, to compute that we might read a lot, which could be destabilizing if we disable KV limits. Yet a similar argument could be made that our current "equality columns are key" heuristic is not safe - it's possible to construct a scenario where we look up large amounts of data.) In order to prevent this new heuristic from exploding in some edge cases, two guardrails are added: - in order to handle a scenario where the lookup ratio is not evenly distributed (i.e. different input rows can result in vastly different number of looked up rows), we'll disable the heuristic if the max lookup ratio exceeds the allowed limit. - in order to handle a scenario where looked rows are very large, we'll disable the heuristic if the estimated average lookup row size exceeds the allowed limit. (Note that we don't have this kind of protection in the existing heuristics.) I plan to do some more empirical runs to fine-tune the default values of the newly added session variables, but the current defaults are: - `parallelize_multi_key_lookup_joins_avg_lookup_ratio = 10` - `parallelize_multi_key_lookup_joins_max_lookup_ratio = 10000` - `parallelize_multi_key_lookup_joins_avg_lookup_row_size = 100 KiB`. In order to de-risk rollout of this feature, we will initially apply the new heuristic only to mutations of multi-region tables. New session variable `parallelize_multi_key_lookup_joins_only_on_mr_mutations` can be set to `false` to apply the heuristic to all statements, regardless of the table being multi-region. Release note (performance improvement): Mutation statements (UPDATEs and DELETEs) that perform lookup joins into multi-region tables (perhaps as part of a CASCADE) are now more likely to parallelize the lookups across ranges which improves their performance.

Previously, the `execbuilder.Builder` that we use in recursive CTE iterations referenced the same context that the Builder captured on the main query path. This is problematic since that context might have a tracing span that's already been finished. This commit fixes this issue by explicitly passing the context argument into the iteration function. This was only exposed because we added some logging in the following commit but has been present for a while now. Release note: None

This commit adjusts a few existing EXPLAIN tests to use VERBOSE as well as adds another RBR setup with CASCADE FKs. These will be used to highlight the parallelization change in the multi-key lookup joins in the following commit. Release note: None

blathers-crl · 2025-07-11T21:11:19Z

Thanks for opening a backport.

Before merging, please confirm that the change does not break backwards compatibility and otherwise complies with the backport policy. Include a brief release justification in the PR description explaining why the backport is appropriate. All backports must be reviewed by the TL for the owning area. While the stricter LTS policy does not yet apply, please exercise judgment and consider gating non-critical changes behind a disabled-by-default feature flag when appropriate.

cockroach-teamcity · 2025-07-11T21:11:32Z

This change is

michae2

LGTM

yuzefovich added 3 commits June 28, 2025 02:20

yuzefovich requested a review from a team as a code owner July 11, 2025 21:11

yuzefovich requested review from DrewKimball and removed request for a team July 11, 2025 21:11

blathers-crl bot added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels Jul 11, 2025

blathers-crl bot requested a review from michae2 July 11, 2025 21:11

yuzefovich added the blathers-backport This is a backport that Blathers created automatically. label Jul 11, 2025

blathers-crl bot added the backport Label PR's that are backports to older release branches label Jul 11, 2025

michae2 approved these changes Jul 11, 2025

View reviewed changes

yuzefovich merged commit de177de into cockroachdb:release-25.3 Jul 13, 2025
24 of 25 checks passed

celeste-cockroachdb bot added the target-release-25.3.0 label Jul 13, 2025

yuzefovich deleted the blathers/backport-release-25.3-148186 branch July 13, 2025 23:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

release-25.3: execbuilder: add "average lookup ratio" parallelization heuristic for lookup joins #150016

release-25.3: execbuilder: add "average lookup ratio" parallelization heuristic for lookup joins #150016

Uh oh!

yuzefovich commented Jul 11, 2025 •

edited

Loading

Uh oh!

blathers-crl bot commented Jul 11, 2025

Uh oh!

cockroach-teamcity commented Jul 11, 2025

Uh oh!

michae2 left a comment

Uh oh!

Uh oh!

Uh oh!

release-25.3: execbuilder: add "average lookup ratio" parallelization heuristic for lookup joins #150016

release-25.3: execbuilder: add "average lookup ratio" parallelization heuristic for lookup joins #150016

Uh oh!

Conversation

yuzefovich commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blathers-crl bot commented Jul 11, 2025

Uh oh!

cockroach-teamcity commented Jul 11, 2025

Uh oh!

michae2 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yuzefovich commented Jul 11, 2025 •

edited

Loading