You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-35703][SQL] Relax constraint for bucket join and remove HashClusteredDistribution
### What changes were proposed in this pull request?
This PR proposes the following:
1. Introducing a new trait `ShuffleSpec` which is used in `EnsureRequirements` when the node has more than one children and serves two purposes: 1) compare all children and check if they are compatible w.r.t partitioning & distribution, 2) create a new partitioning to re-shuffle the other side in case they are not compatible.
2. Remove `HashClusteredDistribution` and replace its usages with `ClusteredDistribution`.
Under the new mechanism, when `EnsureRequirements` check whether shuffles are required for a plan node with >1 children, it does the following:
1. check each child of the node and see if it can satisfy the corresponding required distribution.
2. check if all children of the node are compatible with each other w.r.t their partitioning and distribution
3. if 2) fails, choose the best shuffle spec (in terms of shuffle parallelism) that can be used to repartition the other children, so that they will all have compatible partitioning
### Why are the changes needed?
Spark currently only allow bucket join when the set of cluster keys from output partitioning _exactly match_ the set of join keys from the required distribution. For instance, in the following:
```sql
SELECT * FROM A JOIN B ON A.c1 = B.c1 AND A.c2 = B.c2
```
bucket join will only be triggered if both `A` and `B` are bucketed on columns `c1` and `c2`, in which case Spark will avoid shuffling both sides of the join.
The above requirement, however, is too strict, as shuffle can also be avoided if both `A` and `B` are bucketed on either column `c1` or `c2`. That is, if all rows that have the same value in column `c1` are clustered into the same partition, then all rows have the same values in column `c1` and `c2` are also clustered into the same partition.
In order to allow this, we'll need to change the logic of deciding whether two sides of a join operator are "co-partitioned". Currently, this is done by checking each side's output partitioning against its required distribution separately, using `Partitioning.satisfies` method. Since `HashClusteredDistribution` requires a `HashPartitioning` to have the exact match on the cluster keys, this can be done in isolation without looking at the other side's output partitioning and required distribution.
However, the approach is no longer valid if we are going to relax the above constraint, as we need to compare the output partitioning and required distribution **on both sides**. For instance, in the above example, if `A` is bucketed on `c1` while `B` is bucketed on `c2`, we may need to do the following check:
1. identify where `A.c1` and `B.c2` is used in the join keys (e.g., position 0 and 1 respectively)
2. check if the positions derived from both sides exactly match each other (this becomes more complicated if a key appears in multiple positions within the join keys.)
In order to achieve the above, this proposes the following:
```scala
trait ShuffleSpec {
// Used as a cost indicator to shuffle children
def numPartitions: Int
// Used to check whether this spec is compatible with `other`
def isCompatibleWith(other: ShuffleSpec): Boolean
// Used to create a new partitioning for the other `distribution` in case `isCompatibleWith` failed.
def createPartitioning(distribution: Distribution): Partitioning
}
```
A similar API is also required if we are going to support DSv2 `DataSourcePartitioning` as output partitioning in bucket join scenario, or support custom hash functions such as `HiveHash` for bucketing. With the former, even if both `A` and `B` are partitioned on columns `c1` and `c2` in the above example, they could be partitioned via different transform expressions, e.g., `A` is on `(bucket(32, c1), day(c2)` while `B` is on `(bucket(32, c1), hour(c2)`. This means we'll need to compare the partitioning from both sides of the join which makes the current approach with `Partitioning.satisfies` insufficient. The same API `isCompatibleWith` can potentially be reused for the purpose.
### Does this PR introduce _any_ user-facing change?
Yes, now bucket join will be enabled for more cases as mentioned above.
### How was this patch tested?
1. Added a new test suite `ShuffleSpecSuite`
2. Added additional tests in `EnsureRequirementsSuite`.
Closes#32875 from sunchao/SPARK-35703.
Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
0 commit comments