[SPARK-26297][SQL] improve the doc of Distribution/Partitioning #23249

cloud-fan · 2018-12-06T15:55:36Z

What changes were proposed in this pull request?

Some documents of Distribution/Partitioning are stale and misleading, this PR fixes them:

Distribution never have intra-partition requirement
OrderedDistribution does not require tuples that share the same value being colocated in the same partition.
RangePartitioning can provide a weaker guarantee for a prefix of its ordering expressions.

How was this patch tested?

comment-only PR.

cloud-fan · 2018-12-06T15:56:04Z

cc @marmbrus @maryannxue @hvanhovell @gatorsmile @viirya

cloud-fan · 2018-12-06T15:59:00Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

+ * partition. They can also across partitions, but these partitions must be contiguous. For example,
+ * if value `v` is the biggest values in partition 3, it can also be in partition 4 as the smallest
+ * value. If all the values in partition 4 are `v`, it can also be in partition 5 as the smallest
+ * value.
 */
 case class OrderedDistribution(ordering: Seq[SortOrder]) extends Distribution {


This is only used by sort, and sort doesn't require rows of same value to be colocated in the same partition.

Actually we already use this knowledge to optimize RangePartitioning.satisfy

maryannxue · 2018-12-06T17:22:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

+ * Distribution here refers to inter-node partitioning of data:
+ *   The distribution describes how tuples are partitioned across physical machines in a cluster.
+ *   Knowing this property allows some operators (e.g., Aggregate) to perform partition local
+ *   operations instead of global ones.
 */


Do we also need to mention that there's another related but orthogonal physical property, i.e., the intra-partition ordering and maybe list an example here how operators take advantage of these two physical properties together?

I intentionally remove everything about intra-partition, as we never leverage it and no partitioning provides this property. Did I miss something?

Yes, I understand that partitioning has nothing to do with intra-partition ordering at all. And it was wrong to include intra-partition ordering as part of the distribution properties. But I was thinking mentioning ordering as a side note would probably help ppl understand better how some operators work. Or maybe here's not the best place to put it.

for ordering, I think people can look at OrderedDistribution?

SparkQA · 2018-12-06T19:34:55Z

Test build #99775 has finished for PR 23249 at commit 24ea28a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-06T20:36:26Z

Test build #99779 has finished for PR 23249 at commit 3df1e44.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maryannxue · 2018-12-07T04:38:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

@@ -243,10 +248,19 @@ case class HashPartitioning(expressions: Seq[Expression], numPartitions: Int)
 * Represents a partitioning where rows are split across partitions based on some total ordering of
 * the expressions specified in `ordering`.  When data is partitioned in this manner the following


nit: add "," after "this manner".

SparkQA · 2018-12-07T08:05:01Z

Test build #99814 has finished for PR 23249 at commit 130bc95.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-07T08:05:01Z

Test build #99812 has finished for PR 23249 at commit 04be19e.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-12-07T09:14:36Z

retest this please

viirya · 2018-12-07T10:09:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

- * partitions.
+ * [[Expression Expressions]]. Its requirement is defined as the following:
+ *   - Given any 2 adjacent partitions, all the rows of the second partition must be larger than or
+ *     equal to any row in the first partition, according to the `ordering` expressions.


Why here we need this equality? Can we just have all the rows in the second partition must be larger than any row in the first partition? Do we need or use such equality?

Note that, only sort requires OrderedDistribution, and global sort doesn't care if there are equal-rows across partitions.

Here is a definition of the requirement. When designing protocols, it's important to make the requirement as weak as possible, and make guarantees as strong as possible.

Global sort (actually the RangePartitioner) currently guarantees that all rows in partition p + 1 are larger than the rows in partition p. I don't think we should relax this, besides collect limit there aren't any use cases I can think of that could work with this relaxed requirement.

Let us keep the semantics unchanged at this moment. If needed, in the future, we can either introduce a new distribution type or change the existing types.

@hvanhovell We need this relaxed requirement, otherwise we have to remove the optimization here

I did not change the semantic, I just correct the comment to represent what the current semantic is.

Yes, @cloud-fan is right about the "or equal to" part is necessary for RangePartitioning(a, b, c) satisfying OrderedDistribution(a, b).

SparkQA · 2018-12-07T13:02:16Z

Test build #99821 has finished for PR 23249 at commit 130bc95.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-12-10T08:51:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

+          // If `ordering` is a prefix of `requiredOrdering`:
+          //   - Let's say `ordering` is [a, b] and `requiredOrdering` is [a, b, c]. If a row is
+          //     larger than another row w.r.t. [a, b], it's also larger w.r.t. [a, b, c]. So
+          //     `RangePartitioning(a, b)` satisfy `OrderedDistribution(a, b, c)`.


nit satisfy -> satisfies

hvanhovell · 2018-12-10T15:53:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

- * [[Expression Expressions]] will be co-located. Based on the context, this
- * can mean such tuples are either co-located in the same partition or they will be contiguous
- * within a single partition.
+ * [[Expression Expressions]] will be co-located in the same partition.


What is [[Expression Expressions]] mean? Should it be [[Expression]]s?

Nvm, this is actually a way to name the link. I have learned something here :)...

SparkQA · 2018-12-10T17:08:55Z

Test build #99912 has finished for PR 23249 at commit adfcec4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2018-12-10T17:12:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

- *  - Each partition will have a `min` and `max` row, relative to the given ordering.  All rows
- *    that are in between `min` and `max` in this `ordering` will reside in this partition.
+ * the expressions specified in `ordering`.  When data is partitioned in this manner, it guarantees:
+ *   - Given any 2 adjacent partitions, all the rows of the second partition must be larger than


Nit don't use bullets if you have only one of them

gatorsmile · 2018-12-10T19:27:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

+ * Distribution here refers to inter-node partitioning of data:
+ *   - The distribution describes how tuples are partitioned across physical machines in a cluster.
+ *     Knowing this property allows some operators (e.g., Aggregate) to perform partition local
+ *     operations instead of global ones.


How about?

Distribution here refers to inter-node partitioning of data. That is, it describes how tuples are partitioned across physical machines in a cluster. Knowing this property allows some operators (e.g., Aggregate) to perform partition local operations instead of global ones.

SparkQA · 2018-12-11T05:13:59Z

Test build #99941 has finished for PR 23249 at commit ddb82c3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maryannxue · 2018-12-11T19:32:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

+          //     `RangePartitioning(a, b)` satisfies `OrderedDistribution(a, b, c)`.
+          //
+          // If `requiredOrdering` is a prefix of `ordering`:
+          //   - Let's say `ordering` is [a, b, c] and `requiredOrdering` is [a, b]. If a row is


"If a row is ... satisfies ..." => According to the RangePartitioning definition, any [a, b, c] in a previous partition must be smaller than any [a, b, c] in the following partition, which means for any [a1, b1, c1] in the previous partition, [a2, b2, c2] in the following partition, either 1) [a1, b1] is smaller than [a2, b2]; or 2) [a1, b1] is equal to [a2, b2] and c1 smaller is than c2. So RangePartitioning(a, b, c) satisfies OrderedDistribution(a, b) which requires any [a1, b1] from a previous partition smaller than any [a2, b2] from a following partition."

maryannxue

LGTM, except one comment.

SparkQA · 2018-12-12T08:05:01Z

Test build #100005 has finished for PR 23249 at commit cb94add.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-12-12T08:15:40Z

retest this please.

SparkQA · 2018-12-12T12:02:41Z

Test build #100012 has finished for PR 23249 at commit cb94add.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-12-12T12:11:44Z

retest this please

SparkQA · 2018-12-12T15:17:40Z

Test build #100021 has finished for PR 23249 at commit cb94add.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-12-12T15:47:43Z

retest this please

SparkQA · 2018-12-12T19:53:21Z

Test build #100029 has finished for PR 23249 at commit cb94add.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-12-13T03:11:56Z

This is a comment-only PR and the Spark R test is a known issue, I'm merging it to master, thanks!

## What changes were proposed in this pull request? Some documents of `Distribution/Partitioning` are stale and misleading, this PR fixes them: 1. `Distribution` never have intra-partition requirement 2. `OrderedDistribution` does not require tuples that share the same value being colocated in the same partition. 3. `RangePartitioning` can provide a weaker guarantee for a prefix of its `ordering` expressions. ## How was this patch tested? comment-only PR. Closes apache#23249 from cloud-fan/doc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan mentioned this pull request Dec 6, 2018

[SPARK-26021][SQL][followup] only deal with NaN and -0.0 in UnsafeWriter #23239

Closed

cloud-fan commented Dec 6, 2018

View reviewed changes

cloud-fan force-pushed the doc branch from 24ea28a to 3df1e44 Compare December 6, 2018 16:42

maryannxue reviewed Dec 6, 2018

View reviewed changes

maryannxue reviewed Dec 7, 2018

View reviewed changes

cloud-fan force-pushed the doc branch 3 times, most recently from e20dba6 to 130bc95 Compare December 7, 2018 07:41

improve the doc of Distribution/Partitioning

130bc95

viirya reviewed Dec 7, 2018

View reviewed changes

viirya approved these changes Dec 9, 2018

View reviewed changes

HyukjinKwon approved these changes Dec 10, 2018

View reviewed changes

fix typo

adfcec4

hvanhovell reviewed Dec 10, 2018

View reviewed changes

gatorsmile reviewed Dec 10, 2018

View reviewed changes

code style

ddb82c3

maryannxue reviewed Dec 11, 2018

View reviewed changes

improve comment

cb94add

asfgit closed this in 05b68d5 Dec 13, 2018

		@@ -243,10 +248,19 @@ case class HashPartitioning(expressions: Seq[Expression], numPartitions: Int)
		* Represents a partitioning where rows are split across partitions based on some total ordering of
		* the expressions specified in `ordering`. When data is partitioned in this manner the following

[SPARK-26297][SQL] improve the doc of Distribution/Partitioning #23249

[SPARK-26297][SQL] improve the doc of Distribution/Partitioning #23249

Uh oh!

Conversation

cloud-fan commented Dec 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Dec 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 6, 2018

Uh oh!

SparkQA commented Dec 6, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 7, 2018

Uh oh!

SparkQA commented Dec 7, 2018

Uh oh!

cloud-fan commented Dec 7, 2018

Uh oh!

viirya Dec 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 7, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 10, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 11, 2018

Uh oh!

maryannxue Dec 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maryannxue left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 12, 2018

Uh oh!

viirya commented Dec 12, 2018

Uh oh!

SparkQA commented Dec 12, 2018

cloud-fan commented Dec 6, 2018 •

edited

Loading

cloud-fan commented Dec 6, 2018 •

edited

Loading

viirya Dec 7, 2018 •

edited

Loading

cloud-fan Dec 11, 2018 •

edited

Loading

maryannxue Dec 11, 2018 •

edited

Loading