Skip to content

[SPARK-33466][ML][PYTHON] Imputer support mode(most_frequent) strategy #30397

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

impl a new strategy mode: replace missing using the most frequent value along each column.

Why are the changes needed?

it is highly scalable, and had been a function in sklearn.impute.SimpleImputer for a long time.

Does this PR introduce any user-facing change?

Yes, a new strategy is added

How was this patch tested?

updated testsuites

init

py

nit
@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Test build #131214 has finished for PR 30397 at commit 4626614.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35818/

@zhengruifeng
Copy link
Contributor Author

friendly ping @huaxingao @srowen

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35818/

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35819/

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Test build #131216 has finished for PR 30397 at commit 91ae454.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35819/

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35823/

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Test build #131220 has finished for PR 30397 at commit e0605d6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35823/

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems pretty fine to me if tests pass

Iterator.range(0, numCols).flatMap { i =>
// Ignore null.
// negative value to apply the default ranking of [Long, Double]
if (row.isNullAt(i)) Iterator.empty else Iterator.single((i, -row.getDouble(i)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: is None / Some simpler here in the flatMap?

@SparkQA
Copy link

SparkQA commented Nov 18, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35845/

@SparkQA
Copy link

SparkQA commented Nov 18, 2020

Test build #131241 has finished for PR 30397 at commit 5875c65.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 18, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35845/

@srowen srowen closed this in 116b7b7 Nov 20, 2020
@srowen
Copy link
Member

srowen commented Nov 20, 2020

Merged to master

@zhengruifeng
Copy link
Contributor Author

thanks @srowen @zero323 for reivewing!

@zhengruifeng zhengruifeng deleted the imputer_max_freq branch November 23, 2020 01:21
val modes = dataset.select(cols: _*).flatMap { row =>
// Ignore null.
Iterator.range(0, numCols)
.flatMap(i => if (row.isNullAt(i)) None else Some((i, row.getDouble(i))))
Copy link
Member

@srowen srowen May 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long overdue question - this means this doesn't work on 'categorical' vars right? they have to be numbers. But then again, so does everything in a Spark feature vector - Strings are indexed to numbers, etc. Then it would work, it would compute the mode's index correctly as a number.

Just trying to decide whether the docs that say categorical vars are unsupported are accurate or not then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants