Skip to content

[SPARK-33857][SQL] Unify the default seed of random functions #30864

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

ulysses-you
Copy link
Contributor

@ulysses-you ulysses-you commented Dec 20, 2020

What changes were proposed in this pull request?

Unify the seed of random functions

  1. Add a hold place expression UnresolvedSeed as the defualt seed.
  2. Change Rand,Randn,Uuid,Shuffle default seed to UnresolvedSeed .
  3. Replace UnresolvedSeed to real seed at ResolveRandomSeed rule.

Why are the changes needed?

Uuid and Shuffle use the ResolveRandomSeed rule to set the seed if user doesn't give a seed value. Rand and Randn do this at constructing.

It's better to unify the default seed at Analyzer side since we have used ExpressionWithRandomSeed at streaming query.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass exists test and add test.

@github-actions github-actions bot added the SQL label Dec 20, 2020
@SparkQA
Copy link

SparkQA commented Dec 20, 2020

Test build #133103 has finished for PR 30864 at commit 29b5d1e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Shuffle(child: Expression, randomSeed: Expression = DefaultSeed())
  • case class Uuid(randomSeed: Expression = DefaultSeed()) extends LeafExpression with Stateful
  • trait ExpressionWithRandomSeed extends Expression
  • case class DefaultSeed() extends Expression with Unevaluable
  • case class Rand(child: Expression = DefaultSeed(), hideSeed: Boolean = false)
  • case class Randn(child: Expression = DefaultSeed(), hideSeed: Boolean = false)

@SparkQA
Copy link

SparkQA commented Dec 21, 2020

Test build #133119 has finished for PR 30864 at commit 599d6d9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 21, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37725/

@SparkQA
Copy link

SparkQA commented Dec 21, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37725/

@SparkQA
Copy link

SparkQA commented Dec 21, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37729/

@SparkQA
Copy link

SparkQA commented Dec 21, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37729/

@SparkQA
Copy link

SparkQA commented Dec 21, 2020

Test build #133126 has finished for PR 30864 at commit 620692c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 21, 2020

Test build #133131 has finished for PR 30864 at commit 95276c7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 21, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37736/

@SparkQA
Copy link

SparkQA commented Dec 21, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37736/

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Dec 21, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37746/

@SparkQA
Copy link

SparkQA commented Dec 21, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37746/

@SparkQA
Copy link

SparkQA commented Dec 21, 2020

Test build #133137 has finished for PR 30864 at commit 0ed9ca3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Rand(child: Expression, hideSeed: Boolean = false)
  • case class Randn(child: Expression, hideSeed: Boolean = false)

@SparkQA
Copy link

SparkQA commented Dec 21, 2020

Test build #133147 has finished for PR 30864 at commit 0ed9ca3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Rand(child: Expression, hideSeed: Boolean = false)
  • case class Randn(child: Expression, hideSeed: Boolean = false)

@maropu
Copy link
Member

maropu commented Dec 22, 2020

Just a question; any system having uuid with a seed param? I couldn't find it though.

@ulysses-you
Copy link
Contributor Author

ulysses-you commented Dec 22, 2020

@maropu thanks for the thought.

any system having uuid with a seed param?

I haven't seen it too, but before this PR Spark has already support the seed param.

I'd say the main change of this PR is to keep the seed initialize at Analyzer side. For this idea, we make all the rand functions (include uuid) have the same seed param.

@maropu
Copy link
Member

maropu commented Dec 22, 2020

Ah, I see. If so, I think its better to focus on the refactoring for the seed initialization and not to support the new feature: select uuid(1) in this PR.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems too intrusive for the goal; select uuid(1). Could you simplify this PR?

@dongjoon-hyun
Copy link
Member

Oh, I commented before reading @maropu 's last comment. Ya. This looks like two orthogonal purposes.

@ulysses-you
Copy link
Contributor Author

thanks @maropu @dongjoon-hyun , will to narrow the goal and make this PR focus on one thing.

@ulysses-you ulysses-you changed the title [SPARK-33857][SQL] Unify random functions and make Uuid Shuffle support seed in SQL [SPARK-33857][SQL] Unify the default seed of random functions Dec 22, 2020
@SparkQA
Copy link

SparkQA commented Dec 22, 2020

Test build #133203 has finished for PR 30864 at commit 786b9d7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class DefaultSeed() extends Expression with Unevaluable
  • case class Rand(child: Expression, hideSeed: Boolean = false) extends RDG
  • case class Randn(child: Expression, hideSeed: Boolean = false) extends RDG

@SparkQA
Copy link

SparkQA commented Dec 22, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37807/

/**
* A place holder expression used in random functions, will be replaced after analyze.
*/
case class DefaultSeed() extends Expression with Unevaluable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case object UnresolvedSeed extends Expression with Unevaluable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it's better.

@SparkQA
Copy link

SparkQA commented Dec 22, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37807/

@SparkQA
Copy link

SparkQA commented Dec 22, 2020

Test build #133209 has finished for PR 30864 at commit d66af0d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 22, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37822/

@SparkQA
Copy link

SparkQA commented Dec 22, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37822/

@SparkQA
Copy link

SparkQA commented Dec 22, 2020

Test build #133225 has finished for PR 30864 at commit d8237c3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 23, 2020

Test build #133266 has finished for PR 30864 at commit 8d39f5a.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 23, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37865/

@SparkQA
Copy link

SparkQA commented Dec 23, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37865/

@SparkQA
Copy link

SparkQA commented Dec 23, 2020

Test build #133271 has finished for PR 30864 at commit edf0b03.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ulysses-you
Copy link
Contributor Author

I think it's ok to review this. cc @maropu @dongjoon-hyun @cloud-fan


def this(child: Expression) = this(child, false)

override def withNewSeed(seed: Long): Rand = Rand(Literal(seed, LongType))
override def withNewSeed(seed: Long): Rand = Rand(Literal(seed, LongType), hideSeed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, is this related to this PR, @ulysses-you ? This looks like an orthogonal bug fix a little to me. I'm wondering if we we have a test coverage for this.

Copy link
Contributor Author

@ulysses-you ulysses-you Dec 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun thanks for the carefully check. Yes, it could be thought as bug fix. See the jenkins test result before this commit a3c9f0b.

And I think for early branch, we just need to backport this place and leave this PR for master. But also, I'm OK to send a PR to fix this first, then go back to this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an orthogonal bug fix, but it only affects the sql method. I don't have a strong opinion about backporting the fix or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks, @cloud-fan .


def this(child: Expression) = this(child, false)

override def withNewSeed(seed: Long): Randn = Randn(Literal(seed, LongType))
override def withNewSeed(seed: Long): Randn = Randn(Literal(seed, LongType), hideSeed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, this PR looks good, but I have a question on the change because SPARK-33857 is an improvement PR which lands only at master branch.

If we are correcting some fix in this PR, we should split this PR by filing a new JIRA for that and backport to the release branches.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Merged to master for Apache Spark 3.2.0.
Thank you, @ulysses-you and all.

Merry Christmas and Happy New Year!

@maropu
Copy link
Member

maropu commented Dec 25, 2020

late lgtm.

@ulysses-you
Copy link
Contributor Author

thank you all ! Merry Christmas !

jdcasale pushed a commit to palantir/spark that referenced this pull request Jun 22, 2021
### What changes were proposed in this pull request?

Unify the seed of random functions
1. Add a hold place expression `UnresolvedSeed ` as the defualt seed.
2. Change `Rand`,`Randn`,`Uuid`,`Shuffle` default seed to `UnresolvedSeed `.
3. Replace `UnresolvedSeed ` to real seed at `ResolveRandomSeed` rule.

### Why are the changes needed?

`Uuid` and `Shuffle` use the `ResolveRandomSeed` rule to set the seed if user doesn't give a seed value. `Rand` and `Randn` do this at constructing.

It's better to unify the default seed at Analyzer side since we have used `ExpressionWithRandomSeed` at streaming query.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass exists test and add test.

Closes apache#30864 from ulysses-you/SPARK-33857.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants