-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-15382][SQL] Fix a rule to push down projects beneath Sample #14181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #62251 has finished for PR 14181 at commit
|
Test build #62291 has finished for PR 14181 at commit
|
should we just enforce sampling ratio <= 1.0? |
yea, the solution is also okay. Is it okay to fix in that way? |
FYI, it seems it still happens even if ratio is less than 1.0 because it is sampling with replacement.
|
FYI, if replacement is disabled, it is failed when the ratio is more than 1.0.
|
@HyukjinKwon @rxin thx for your survey. You're right, it seems inputs are possibly sampled twice in the current implementation even when |
Yea, sampling with replacement expects the results can be duplicated (see http://stattrek.com/statistics/dictionary.aspx?definition=Sampling_with_replacement). IMHO, this fix should be enabled always when |
@rxin @HyukjinKwon could you re-check? |
* @param seed Seed for sampling. | ||
* | ||
* @group typedrel | ||
* @since 1.6.0 | ||
*/ | ||
def sample(withReplacement: Boolean, fraction: Double, seed: Long): Dataset[T] = withTypedPlan { | ||
Sample(0.0, fraction, withReplacement, seed, logicalPlan)() | ||
if (0.0 < fraction && fraction < 1.0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shoudn't this be fraction <= 1.0
? It seems it is when replace
is false
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about sampling methods though, is it natural that sampling
methods have fraction>1.0
?
Seems sampling naturally means it randomly picks the part of input data. Is this incorrect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If my understanding is correct, sampling is kind of extracting a predetermined number of observations that are taken from a larger population. I mean.. the definition of the word "sample" is "a small amount of something that gives you information about the thing it was taken from".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for your explanation. If so, the case fraction>1.0
is meaningless, I think.
Test build #62301 has finished for PR 14181 at commit
|
Test build #62303 has finished for PR 14181 at commit
|
Test build #62309 has finished for PR 14181 at commit
|
Test build #62322 has finished for PR 14181 at commit
|
Test build #62331 has finished for PR 14181 at commit
|
Test build #64046 has finished for PR 14181 at commit
|
Test build #64048 has finished for PR 14181 at commit
|
def apply(plan: LogicalPlan): LogicalPlan = plan transform { | ||
// Push down projection into sample | ||
case proj @ Project(projectList, Sample(lb, up, replace, seed, child)) => | ||
if (!replace || !projectList.exists(_.find(!_.deterministic).nonEmpty)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second condition looks complicated. Just projectList.forall(_.deterministic)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, thanks! I'll fix this.
@viirya Ah, I noticed this issue has already been fixed in your pr #14327(SPARK-16686). |
require(fraction >= 0, | ||
s"Fraction must be nonnegative, but got ${fraction}") | ||
require(fraction >= 0 && fraction <= 1.0, | ||
s"Fraction range must be 0.0 <= `fraction` <= 1.0, but got ${fraction}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @maropu, I just wonder if this fix is still needed though just to be consistent whether withRelacement
is true
or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon oh, you're right and my bad... thanks! Since this original pr is far from this bug, I'll make a new jira ticket and a pr soon later.
What changes were proposed in this pull request?
When
X
> 1.0 inDataset#sample
,sample(true, X).withColumn("x", monotonically_increasing_id)
cannot have unique ids. This pr fixes this bug.How was this patch tested?
Added tests in
DataFrameSuite
.