Skip to content

[SPARK-15382][SQL] Fix a rule to push down projects beneath Sample #14181

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

maropu
Copy link
Member

@maropu maropu commented Jul 13, 2016

What changes were proposed in this pull request?

When X > 1.0 in Dataset#sample, sample(true, X).withColumn("x", monotonically_increasing_id) cannot have unique ids. This pr fixes this bug.

How was this patch tested?

Added tests in DataFrameSuite.

@SparkQA
Copy link

SparkQA commented Jul 13, 2016

Test build #62251 has finished for PR 14181 at commit 9a5f975.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 14, 2016

Test build #62291 has finished for PR 14181 at commit 5c4d0df.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Jul 14, 2016

should we just enforce sampling ratio <= 1.0?

@maropu
Copy link
Member Author

maropu commented Jul 14, 2016

yea, the solution is also okay. Is it okay to fix in that way?

@HyukjinKwon
Copy link
Member

FYI, it seems it still happens even if ratio is less than 1.0 because it is sampling with replacement.

scala> spark.range(10).sample(true, 0.5).withColumn("mid", monotonically_increasing_id).show()
+---+-----------+
| id|        mid|
+---+-----------+
|  0|          0|
|  1| 8589934592|
|  4|25769803777|
|  4|25769803777|
|  5|34359738368|
|  7|51539607552|
|  8|60129542144|
+---+-----------+
scala> spark.range(10).sample(true, 0.5).withColumn("mid", monotonically_increasing_id).show()
+---+-----------+
| id|        mid|
+---+-----------+
|  0|          0|
|  0|          0|
|  1| 8589934592|
|  2|17179869184|
|  3|25769803776|
|  3|25769803776|
|  6|42949672960|
|  9|60129542145|
|  9|60129542145|
+---+-----------+

@HyukjinKwon
Copy link
Member

FYI, if replacement is disabled, it is failed when the ratio is more than 1.0.

scala> spark.range(10).sample(false, 1.1).withColumn("mid", monotonically_increasing_id).show()
16/07/14 15:04:56 ERROR Executor: Exception in task 0.0 in stage 94.0 (TID 376)
java.lang.IllegalArgumentException: requirement failed: Upper bound (1.1) must be <= 1.0
    at scala.Predef$.require(Predef.scala:224)
    at org.apache.spark.util.random.BernoulliCellSampler.<init>(RandomSampler.scala:109)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.init(Unknown Source)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8.apply(WholeStageCodegenExec.scala:367)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8.apply(WholeStageCodegenExec.scala:364)

@maropu
Copy link
Member Author

maropu commented Jul 14, 2016

@HyukjinKwon @rxin thx for your survey. You're right, it seems inputs are possibly sampled twice in the current implementation even when fraction<1.0. Is this behaviour is expected? This highly depends on sampling implementations.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jul 14, 2016

Yea, sampling with replacement expects the results can be duplicated (see http://stattrek.com/statistics/dictionary.aspx?definition=Sampling_with_replacement). IMHO, this fix should be enabled always when replace is true to deal with the issue in this way.

@maropu
Copy link
Member Author

maropu commented Jul 14, 2016

@rxin @HyukjinKwon could you re-check?

* @param seed Seed for sampling.
*
* @group typedrel
* @since 1.6.0
*/
def sample(withReplacement: Boolean, fraction: Double, seed: Long): Dataset[T] = withTypedPlan {
Sample(0.0, fraction, withReplacement, seed, logicalPlan)()
if (0.0 < fraction && fraction < 1.0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shoudn't this be fraction <= 1.0? It seems it is when replace is false :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about sampling methods though, is it natural that sampling methods have fraction>1.0?
Seems sampling naturally means it randomly picks the part of input data. Is this incorrect?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If my understanding is correct, sampling is kind of extracting a predetermined number of observations that are taken from a larger population. I mean.. the definition of the word "sample" is "a small amount of something that gives you information about the thing it was taken from".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for your explanation. If so, the case fraction>1.0 is meaningless, I think.

@SparkQA
Copy link

SparkQA commented Jul 14, 2016

Test build #62301 has finished for PR 14181 at commit e426fc3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 14, 2016

Test build #62303 has finished for PR 14181 at commit 3885f21.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 14, 2016

Test build #62309 has finished for PR 14181 at commit ca23f4f.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 14, 2016

Test build #62322 has finished for PR 14181 at commit a50d3dc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 14, 2016

Test build #62331 has finished for PR 14181 at commit a868e09.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 19, 2016

Test build #64046 has finished for PR 14181 at commit b0f5dd5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 19, 2016

Test build #64048 has finished for PR 14181 at commit c947583.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

def apply(plan: LogicalPlan): LogicalPlan = plan transform {
// Push down projection into sample
case proj @ Project(projectList, Sample(lb, up, replace, seed, child)) =>
if (!replace || !projectList.exists(_.find(!_.deterministic).nonEmpty)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second condition looks complicated. Just projectList.forall(_.deterministic)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, thanks! I'll fix this.

@maropu
Copy link
Member Author

maropu commented Aug 19, 2016

@viirya Ah, I noticed this issue has already been fixed in your pr #14327(SPARK-16686).
So, I'll close this. Thanks!

@maropu maropu closed this Aug 19, 2016
require(fraction >= 0,
s"Fraction must be nonnegative, but got ${fraction}")
require(fraction >= 0 && fraction <= 1.0,
s"Fraction range must be 0.0 <= `fraction` <= 1.0, but got ${fraction}")
Copy link
Member

@HyukjinKwon HyukjinKwon Aug 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @maropu, I just wonder if this fix is still needed though just to be consistent whether withRelacement is true or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon oh, you're right and my bad... thanks! Since this original pr is far from this bug, I'll make a new jira ticket and a pr soon later.

@maropu maropu deleted the SPARK-15382 branch July 5, 2017 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants