-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-33857][SQL] Unify the default seed of random functions #30864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #133103 has finished for PR 30864 at commit
|
Test build #133119 has finished for PR 30864 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #133126 has finished for PR 30864 at commit
|
Test build #133131 has finished for PR 30864 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
retest this please |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #133137 has finished for PR 30864 at commit
|
Test build #133147 has finished for PR 30864 at commit
|
Just a question; any system having uuid with a seed param? I couldn't find it though. |
@maropu thanks for the thought.
I haven't seen it too, but before this PR Spark has already support the seed param. I'd say the main change of this PR is to keep the seed initialize at Analyzer side. For this idea, we make all the rand functions (include uuid) have the same seed param. |
Ah, I see. If so, I think its better to focus on the refactoring for the seed initialization and not to support the new feature: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems too intrusive for the goal; select uuid(1)
. Could you simplify this PR?
Oh, I commented before reading @maropu 's last comment. Ya. This looks like two orthogonal purposes. |
thanks @maropu @dongjoon-hyun , will to narrow the goal and make this PR focus on one thing. |
665cb53
to
0ed9ca3
Compare
0ed9ca3
to
786b9d7
Compare
Test build #133203 has finished for PR 30864 at commit
|
Kubernetes integration test starting |
/** | ||
* A place holder expression used in random functions, will be replaced after analyze. | ||
*/ | ||
case class DefaultSeed() extends Expression with Unevaluable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case object UnresolvedSeed extends Expression with Unevaluable
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it's better.
Kubernetes integration test status failure |
Test build #133209 has finished for PR 30864 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #133225 has finished for PR 30864 at commit
|
Test build #133266 has finished for PR 30864 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #133271 has finished for PR 30864 at commit
|
I think it's ok to review this. cc @maropu @dongjoon-hyun @cloud-fan |
|
||
def this(child: Expression) = this(child, false) | ||
|
||
override def withNewSeed(seed: Long): Rand = Rand(Literal(seed, LongType)) | ||
override def withNewSeed(seed: Long): Rand = Rand(Literal(seed, LongType), hideSeed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ur, is this related to this PR, @ulysses-you ? This looks like an orthogonal bug fix a little to me. I'm wondering if we we have a test coverage for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun thanks for the carefully check. Yes, it could be thought as bug fix. See the jenkins test result before this commit a3c9f0b.
And I think for early branch, we just need to backport this place and leave this PR for master. But also, I'm OK to send a PR to fix this first, then go back to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an orthogonal bug fix, but it only affects the sql
method. I don't have a strong opinion about backporting the fix or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Thanks, @cloud-fan .
|
||
def this(child: Expression) = this(child, false) | ||
|
||
override def withNewSeed(seed: Long): Randn = Randn(Literal(seed, LongType)) | ||
override def withNewSeed(seed: Long): Randn = Randn(Literal(seed, LongType), hideSeed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, this PR looks good, but I have a question on the change because SPARK-33857 is an improvement PR which lands only at master
branch.
If we are correcting some fix in this PR, we should split this PR by filing a new JIRA for that and backport to the release branches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Merged to master for Apache Spark 3.2.0.
Thank you, @ulysses-you and all.
Merry Christmas and Happy New Year!
late lgtm. |
thank you all ! Merry Christmas ! |
### What changes were proposed in this pull request? Unify the seed of random functions 1. Add a hold place expression `UnresolvedSeed ` as the defualt seed. 2. Change `Rand`,`Randn`,`Uuid`,`Shuffle` default seed to `UnresolvedSeed `. 3. Replace `UnresolvedSeed ` to real seed at `ResolveRandomSeed` rule. ### Why are the changes needed? `Uuid` and `Shuffle` use the `ResolveRandomSeed` rule to set the seed if user doesn't give a seed value. `Rand` and `Randn` do this at constructing. It's better to unify the default seed at Analyzer side since we have used `ExpressionWithRandomSeed` at streaming query. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass exists test and add test. Closes apache#30864 from ulysses-you/SPARK-33857. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
Unify the seed of random functions
UnresolvedSeed
as the defualt seed.Rand
,Randn
,Uuid
,Shuffle
default seed toUnresolvedSeed
.UnresolvedSeed
to real seed atResolveRandomSeed
rule.Why are the changes needed?
Uuid
andShuffle
use theResolveRandomSeed
rule to set the seed if user doesn't give a seed value.Rand
andRandn
do this at constructing.It's better to unify the default seed at Analyzer side since we have used
ExpressionWithRandomSeed
at streaming query.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Pass exists test and add test.