Skip to content

[SPARK-22957] ApproxQuantile breaks if the number of rows exceeds MaxInt #20152

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

juliuszsompolski
Copy link
Contributor

What changes were proposed in this pull request?

32bit Int was used for row rank.
That overflowed in a dataframe with more than 2B rows.

How was this patch tested?

Added test, but ignored, as it takes 4 minutes.

@SparkQA
Copy link

SparkQA commented Jan 4, 2018

Test build #85679 has finished for PR 20152 at commit 324218b.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jan 4, 2018

Seems reasonable; @clockfly what do you think?
The serialized form changed here, but I'm not sure we'd guarantee compatibility there. This isn't something serialized to files for long-term storage is it?

@juliuszsompolski
Copy link
Contributor Author

If the serialized form change is a problem, that part can probably be reverted - it's far less likely that a single compressed stats chunk will overflow Int.
The bug I hit was in the global rank counter part, and I changed the other part just by reviewing the code around for other places that could conceivably use a Long instead of Int.

@SparkQA
Copy link

SparkQA commented Jan 4, 2018

Test build #85680 has finished for PR 20152 at commit e32f254.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Stats(value: Double, g: Long, delta: Long)

@rxin
Copy link
Contributor

rxin commented Jan 4, 2018

cc @gatorsmile @cloud-fan

@SparkQA
Copy link

SparkQA commented Jan 4, 2018

Test build #85684 has finished for PR 20152 at commit 448dcce.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM, merging to master/2.3!

asfgit pushed a commit that referenced this pull request Jan 5, 2018
## What changes were proposed in this pull request?

32bit Int was used for row rank.
That overflowed in a dataframe with more than 2B rows.

## How was this patch tested?

Added test, but ignored, as it takes 4 minutes.

Author: Juliusz Sompolski <julek@databricks.com>

Closes #20152 from juliuszsompolski/SPARK-22957.

(cherry picked from commit df7fc3e)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@asfgit asfgit closed this in df7fc3e Jan 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants