Storage-Partitioned Joins and KeyGroupedPartitioning

jaceklaskowski · jaceklaskowski · commit 0a990a43e625 · 2024-05-26T11:03:41.000+02:00
diff --git a/docs/connector/KeyGroupedPartitioning.md b/docs/connector/KeyGroupedPartitioning.md
@@ -0,0 +1,15 @@
+# KeyGroupedPartitioning
+
+`KeyGroupedPartitioning` is a [Partitioning](Partitioning.md) where rows are split across partitions based on the [partition transform expressions](#keys).
+
+`KeyGroupedPartitioning` is a key part of [Storage-Partitioned Joins](../storage-partitioned-joins/index.md).
+
+!!! note
+    Not used in any of the [built-in Spark SQL connectors](../connectors/index.md) yet.
+
+## Creating Instance
+
+`KeyGroupedPartitioning` takes the following to be created:
+
+* <span id="keys"> Partition transform [expression](../expressions/Expression.md)s
+* <span id="numPartitions"> Number of partitions
diff --git a/docs/connector/Partitioning.md b/docs/connector/Partitioning.md
@@ -4,14 +4,14 @@ title: Partitioning
 
 # Partitioning
 
-`Partitioning` is an [abstraction](#contract) of [output data partitioning requirements](#implementations) (_data distribution_) of a Spark SQL connector.
+`Partitioning` is an [abstraction](#contract) of [output data partitioning requirements](#implementations) (_data distribution_) of a [Spark SQL connector](index.md).
 
 !!! note
     This `Partitioning` interface for Spark SQL developers mimics the internal Catalyst [Partitioning](../physical-operators/Partitioning.md) that is converted into with the help of [DataSourcePartitioning](../physical-operators/Partitioning.md#DataSourcePartitioning).
 
 ## Contract
 
-### <span id="numPartitions"> Number of Partitions
+### Number of Partitions { #numPartitions }
 
 ```java
 int numPartitions()
@@ -21,7 +21,7 @@ Used when:
 
 * [DataSourcePartitioning](../physical-operators/Partitioning.md#DataSourcePartitioning) is requested for the [number of partitions](../physical-operators/Partitioning.md#numPartitions)
 
-### <span id="satisfy"> Satisfying Distribution
+### Satisfying Distribution { #satisfy }
 
 ```java
 boolean satisfy(
@@ -34,5 +34,5 @@ Used when:
 
 ## Implementations
 
-* `KeyGroupedPartitioning`
+* [KeyGroupedPartitioning](KeyGroupedPartitioning.md)
 * `UnknownPartitioning`
diff --git a/docs/storage-partitioned-joins/.pages b/docs/storage-partitioned-joins/.pages
@@ -0,0 +1,4 @@
+title: Storage-Partitioned Joins
+nav:
+    - index.md
+    - ...
diff --git a/docs/storage-partitioned-joins/index.md b/docs/storage-partitioned-joins/index.md
@@ -0,0 +1,12 @@
+# Storage-Partitioned Joins
+
+**Storage-Partitioned Joins** (_SPJ_) are a new type of [join](../joins.md) in Spark SQL that use the existing storage layout for a partitioned join to avoid expensive shuffles (similarly to [Bucketing](../bucketing/index.md)).
+
+!!! note
+    Storage-Partitioned Joins feature was added in Apache Spark 3.3.0 ([\[SPARK-37375\] Umbrella: Storage Partitioned Join (SPJ)]({{ spark.jira }}/SPARK-37375)).
+
+Storage-Partitioned Join is meant mainly, if not exclusively, for [Spark SQL connectors](../connector/index.md) (_v2 data sources_).
+
+Storage-Partitioned Join was proposed in this [SPIP](https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE).
+
+Storage-Partitioned Join uses [KeyGroupedPartitioning](../connector/KeyGroupedPartitioning.md) to determine partitions.
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -163,6 +163,7 @@ nav:
     - ... | bloom-filter-join/**.md
     - ... | bucketing/**.md
     - ... | cache-serialization/**.md
+    - ... | storage-partitioned-joins/**.md
     - Catalog Plugin API:
       - connector/catalog/index.md
       - CatalogExtension: connector/catalog/CatalogExtension.md