Skip to content

Commit 0a990a4

Browse files
Storage-Partitioned Joins and KeyGroupedPartitioning
1 parent 6409885 commit 0a990a4

File tree

5 files changed

+36
-4
lines changed

5 files changed

+36
-4
lines changed
+15
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# KeyGroupedPartitioning
2+
3+
`KeyGroupedPartitioning` is a [Partitioning](Partitioning.md) where rows are split across partitions based on the [partition transform expressions](#keys).
4+
5+
`KeyGroupedPartitioning` is a key part of [Storage-Partitioned Joins](../storage-partitioned-joins/index.md).
6+
7+
!!! note
8+
Not used in any of the [built-in Spark SQL connectors](../connectors/index.md) yet.
9+
10+
## Creating Instance
11+
12+
`KeyGroupedPartitioning` takes the following to be created:
13+
14+
* <span id="keys"> Partition transform [expression](../expressions/Expression.md)s
15+
* <span id="numPartitions"> Number of partitions

docs/connector/Partitioning.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -4,14 +4,14 @@ title: Partitioning
44

55
# Partitioning
66

7-
`Partitioning` is an [abstraction](#contract) of [output data partitioning requirements](#implementations) (_data distribution_) of a Spark SQL connector.
7+
`Partitioning` is an [abstraction](#contract) of [output data partitioning requirements](#implementations) (_data distribution_) of a [Spark SQL connector](index.md).
88

99
!!! note
1010
This `Partitioning` interface for Spark SQL developers mimics the internal Catalyst [Partitioning](../physical-operators/Partitioning.md) that is converted into with the help of [DataSourcePartitioning](../physical-operators/Partitioning.md#DataSourcePartitioning).
1111

1212
## Contract
1313

14-
### <span id="numPartitions"> Number of Partitions
14+
### Number of Partitions { #numPartitions }
1515

1616
```java
1717
int numPartitions()
@@ -21,7 +21,7 @@ Used when:
2121

2222
* [DataSourcePartitioning](../physical-operators/Partitioning.md#DataSourcePartitioning) is requested for the [number of partitions](../physical-operators/Partitioning.md#numPartitions)
2323

24-
### <span id="satisfy"> Satisfying Distribution
24+
### Satisfying Distribution { #satisfy }
2525

2626
```java
2727
boolean satisfy(
@@ -34,5 +34,5 @@ Used when:
3434

3535
## Implementations
3636

37-
* `KeyGroupedPartitioning`
37+
* [KeyGroupedPartitioning](KeyGroupedPartitioning.md)
3838
* `UnknownPartitioning`

docs/storage-partitioned-joins/.pages

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
title: Storage-Partitioned Joins
2+
nav:
3+
- index.md
4+
- ...
+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# Storage-Partitioned Joins
2+
3+
**Storage-Partitioned Joins** (_SPJ_) are a new type of [join](../joins.md) in Spark SQL that use the existing storage layout for a partitioned join to avoid expensive shuffles (similarly to [Bucketing](../bucketing/index.md)).
4+
5+
!!! note
6+
Storage-Partitioned Joins feature was added in Apache Spark 3.3.0 ([\[SPARK-37375\] Umbrella: Storage Partitioned Join (SPJ)]({{ spark.jira }}/SPARK-37375)).
7+
8+
Storage-Partitioned Join is meant mainly, if not exclusively, for [Spark SQL connectors](../connector/index.md) (_v2 data sources_).
9+
10+
Storage-Partitioned Join was proposed in this [SPIP](https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE).
11+
12+
Storage-Partitioned Join uses [KeyGroupedPartitioning](../connector/KeyGroupedPartitioning.md) to determine partitions.

mkdocs.yml

+1
Original file line numberDiff line numberDiff line change
@@ -163,6 +163,7 @@ nav:
163163
- ... | bloom-filter-join/**.md
164164
- ... | bucketing/**.md
165165
- ... | cache-serialization/**.md
166+
- ... | storage-partitioned-joins/**.md
166167
- Catalog Plugin API:
167168
- connector/catalog/index.md
168169
- CatalogExtension: connector/catalog/CatalogExtension.md

0 commit comments

Comments
 (0)