[SPARK-15382][SQL] Fix a rule to push down projects beneath Sample #14181

maropu · 2016-07-13T15:20:25Z

What changes were proposed in this pull request?

When X > 1.0 in Dataset#sample, sample(true, X).withColumn("x", monotonically_increasing_id) cannot have unique ids. This pr fixes this bug.

How was this patch tested?

Added tests in DataFrameSuite.

SparkQA · 2016-07-13T17:14:10Z

Test build #62251 has finished for PR 14181 at commit 9a5f975.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-14T03:46:38Z

Test build #62291 has finished for PR 14181 at commit 5c4d0df.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-07-14T05:29:15Z

should we just enforce sampling ratio <= 1.0?

maropu · 2016-07-14T05:36:01Z

yea, the solution is also okay. Is it okay to fix in that way?

HyukjinKwon · 2016-07-14T06:04:33Z

FYI, it seems it still happens even if ratio is less than 1.0 because it is sampling with replacement.

scala> spark.range(10).sample(true, 0.5).withColumn("mid", monotonically_increasing_id).show()
+---+-----------+
| id|        mid|
+---+-----------+
|  0|          0|
|  1| 8589934592|
|  4|25769803777|
|  4|25769803777|
|  5|34359738368|
|  7|51539607552|
|  8|60129542144|
+---+-----------+

scala> spark.range(10).sample(true, 0.5).withColumn("mid", monotonically_increasing_id).show()
+---+-----------+
| id|        mid|
+---+-----------+
|  0|          0|
|  0|          0|
|  1| 8589934592|
|  2|17179869184|
|  3|25769803776|
|  3|25769803776|
|  6|42949672960|
|  9|60129542145|
|  9|60129542145|
+---+-----------+

HyukjinKwon · 2016-07-14T06:05:30Z

FYI, if replacement is disabled, it is failed when the ratio is more than 1.0.

scala> spark.range(10).sample(false, 1.1).withColumn("mid", monotonically_increasing_id).show()
16/07/14 15:04:56 ERROR Executor: Exception in task 0.0 in stage 94.0 (TID 376)
java.lang.IllegalArgumentException: requirement failed: Upper bound (1.1) must be <= 1.0
    at scala.Predef$.require(Predef.scala:224)
    at org.apache.spark.util.random.BernoulliCellSampler.<init>(RandomSampler.scala:109)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.init(Unknown Source)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8.apply(WholeStageCodegenExec.scala:367)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8.apply(WholeStageCodegenExec.scala:364)

maropu · 2016-07-14T06:50:21Z

@HyukjinKwon @rxin thx for your survey. You're right, it seems inputs are possibly sampled twice in the current implementation even when fraction<1.0. Is this behaviour is expected? This highly depends on sampling implementations.

HyukjinKwon · 2016-07-14T07:07:54Z

Yea, sampling with replacement expects the results can be duplicated (see http://stattrek.com/statistics/dictionary.aspx?definition=Sampling_with_replacement). IMHO, this fix should be enabled always when replace is true to deal with the issue in this way.

maropu · 2016-07-14T07:54:18Z

@rxin @HyukjinKwon could you re-check?

HyukjinKwon · 2016-07-14T08:13:52Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

   * @param seed Seed for sampling.
   *
   * @group typedrel
   * @since 1.6.0
   */
  def sample(withReplacement: Boolean, fraction: Double, seed: Long): Dataset[T] = withTypedPlan {
-    Sample(0.0, fraction, withReplacement, seed, logicalPlan)()
+    if (0.0 < fraction && fraction < 1.0) {


Shoudn't this be fraction <= 1.0? It seems it is when replace is false :)

I'm not sure about sampling methods though, is it natural that sampling methods have fraction>1.0?
Seems sampling naturally means it randomly picks the part of input data. Is this incorrect?

If my understanding is correct, sampling is kind of extracting a predetermined number of observations that are taken from a larger population. I mean.. the definition of the word "sample" is "a small amount of something that gives you information about the thing it was taken from".

thanks for your explanation. If so, the case fraction>1.0 is meaningless, I think.

SparkQA · 2016-07-14T09:08:34Z

Test build #62301 has finished for PR 14181 at commit e426fc3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-14T10:02:26Z

Test build #62303 has finished for PR 14181 at commit 3885f21.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-14T12:07:13Z

Test build #62309 has finished for PR 14181 at commit ca23f4f.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-14T16:11:51Z

Test build #62322 has finished for PR 14181 at commit a50d3dc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-14T19:12:41Z

Test build #62331 has finished for PR 14181 at commit a868e09.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-19T08:07:05Z

Test build #64046 has finished for PR 14181 at commit b0f5dd5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-19T08:09:32Z

Test build #64048 has finished for PR 14181 at commit c947583.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-08-19T08:39:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+    // Push down projection into sample
+    case proj @ Project(projectList, Sample(lb, up, replace, seed, child)) =>
+      if (!replace || !projectList.exists(_.find(!_.deterministic).nonEmpty)) {


The second condition looks complicated. Just projectList.forall(_.deterministic)?

yea, thanks! I'll fix this.

maropu · 2016-08-19T10:18:48Z

@viirya Ah, I noticed this issue has already been fixed in your pr #14327(SPARK-16686).
So, I'll close this. Thanks!

HyukjinKwon · 2016-08-23T23:12:26Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

-    require(fraction >= 0,
-      s"Fraction must be nonnegative, but got ${fraction}")
+    require(fraction >= 0 && fraction <= 1.0,
+      s"Fraction range must be 0.0 <= `fraction` <= 1.0, but got ${fraction}")


Hi @maropu, I just wonder if this fix is still needed though just to be consistent whether withRelacement is true or not.

@HyukjinKwon oh, you're right and my bad... thanks! Since this original pr is far from this bug, I'll make a new jira ticket and a pr soon later.

maropu force-pushed the SPARK-15382 branch from e426fc3 to 3885f21 Compare July 14, 2016 07:52

HyukjinKwon reviewed Jul 14, 2016
View reviewed changes

maropu force-pushed the SPARK-15382 branch from a50d3dc to a868e09 Compare July 14, 2016 17:09

maropu added 4 commits August 19, 2016 15:17

Add a role to avoid incorrect push-downs

700151c

Add an if condition to avoid overheads

426ffca

Apply comments

1890ef5

Add a filter condition in PushProjectThroughSample

f2e6973

maropu force-pushed the SPARK-15382 branch from a868e09 to b0f5dd5 Compare August 19, 2016 06:35

Fix a bug

c947583

maropu force-pushed the SPARK-15382 branch from b0f5dd5 to c947583 Compare August 19, 2016 06:38

viirya reviewed Aug 19, 2016
View reviewed changes

maropu closed this Aug 19, 2016

HyukjinKwon reviewed Aug 23, 2016
View reviewed changes

maropu deleted the SPARK-15382 branch July 5, 2017 11:49

[SPARK-15382][SQL] Fix a rule to push down projects beneath Sample #14181

[SPARK-15382][SQL] Fix a rule to push down projects beneath Sample #14181

Uh oh!

Conversation

maropu commented Jul 13, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jul 13, 2016

Uh oh!

SparkQA commented Jul 14, 2016

Uh oh!

rxin commented Jul 14, 2016

Uh oh!

maropu commented Jul 14, 2016

Uh oh!

HyukjinKwon commented Jul 14, 2016

Uh oh!

HyukjinKwon commented Jul 14, 2016

Uh oh!

maropu commented Jul 14, 2016

Uh oh!

HyukjinKwon commented Jul 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maropu commented Jul 14, 2016

Uh oh!

HyukjinKwon Jul 14, 2016

Choose a reason for hiding this comment

Uh oh!

maropu Jul 14, 2016

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jul 14, 2016

Choose a reason for hiding this comment

Uh oh!

maropu Jul 14, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 14, 2016

Uh oh!

SparkQA commented Jul 14, 2016

Uh oh!

SparkQA commented Jul 14, 2016

Uh oh!

SparkQA commented Jul 14, 2016

Uh oh!

SparkQA commented Jul 14, 2016

Uh oh!

SparkQA commented Aug 19, 2016

Uh oh!

SparkQA commented Aug 19, 2016

Uh oh!

viirya Aug 19, 2016

Choose a reason for hiding this comment

Uh oh!

maropu Aug 19, 2016

Choose a reason for hiding this comment

Uh oh!

maropu commented Aug 19, 2016

Uh oh!

HyukjinKwon Aug 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu Aug 24, 2016

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HyukjinKwon commented Jul 14, 2016 •

edited

Loading

HyukjinKwon Aug 23, 2016 •

edited

Loading