enable dynamic filtering for file hash partitioned data#20217
Draft
gene-bordegaray wants to merge 2 commits intoapache:mainfrom
Draft
enable dynamic filtering for file hash partitioned data#20217gene-bordegaray wants to merge 2 commits intoapache:mainfrom
gene-bordegaray wants to merge 2 commits intoapache:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
Dynamic filter were disabled on partitioned hash joins when
preserve_file_partitionswas true due to #20176 .Refer to the issue: #20195 for more detail.
What changes are included in this PR?
The approach takes the following steps:
RepartitionExecwith Hash partitioning in their plan tree.RepartitionExecon both sides, we cannot guarantee thathash(key) % Nrouting will correctly map rows to partitions. In this case, use a "global OR" approach: combine all per-partition filters with OR into a single expression. This is safe for any partitioning scheme and still effective row-group pruning.RepartitionExec(Hash), the data is truly hash-distributed and the original CASE routing(hash(key) % N)is used for optimal per-partition filtering.Note: The global OR approach is not tied to the
preserve_file_partitionsconfig flag. It applies any time there's noRepartitionExecbetween theDataSourceExecand the join.Alternative Approach - Partition Index Routing
An alternative approach would be to introduce a
PartitionIndexphysical expression that routes probe-side rows to the correct partition's filter based on a partition mapping (partition column value -> partition index). This would look like:CASE partition_index(partition_col) WHEN 0 THEN filter_0 WHEN 1 THEN filter_1 ... ENDWhy I didn't chose:
PruningPredicateuses the OR branches independently. This means something likefilter_0 OR filter_1 OR filter_2achieves identical row-group pruning as the partition indexedCASEexpression. They both evaluate each filter against the row-group statistics, meaning no perf beenfit.Why we may want to consider:
Are these changes tested?
Are there any user-facing changes?
No