-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-33466][ML][PYTHON] Imputer support mode(most_frequent) strategy #30397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #131214 has finished for PR 30397 at commit
|
Kubernetes integration test starting |
friendly ping @huaxingao @srowen |
Kubernetes integration test status success |
Kubernetes integration test starting |
Test build #131216 has finished for PR 30397 at commit
|
Kubernetes integration test status success |
Kubernetes integration test starting |
Test build #131220 has finished for PR 30397 at commit
|
Kubernetes integration test status failure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems pretty fine to me if tests pass
Iterator.range(0, numCols).flatMap { i => | ||
// Ignore null. | ||
// negative value to apply the default ranking of [Long, Double] | ||
if (row.isNullAt(i)) Iterator.empty else Iterator.single((i, -row.getDouble(i))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: is None / Some simpler here in the flatMap?
Kubernetes integration test starting |
Test build #131241 has finished for PR 30397 at commit
|
Kubernetes integration test status success |
Merged to master |
val modes = dataset.select(cols: _*).flatMap { row => | ||
// Ignore null. | ||
Iterator.range(0, numCols) | ||
.flatMap(i => if (row.isNullAt(i)) None else Some((i, row.getDouble(i)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long overdue question - this means this doesn't work on 'categorical' vars right? they have to be numbers. But then again, so does everything in a Spark feature vector - Strings are indexed to numbers, etc. Then it would work, it would compute the mode's index correctly as a number.
Just trying to decide whether the docs that say categorical vars are unsupported are accurate or not then.
What changes were proposed in this pull request?
impl a new strategy
mode
: replace missing using the most frequent value along each column.Why are the changes needed?
it is highly scalable, and had been a function in sklearn.impute.SimpleImputer for a long time.
Does this PR introduce any user-facing change?
Yes, a new strategy is added
How was this patch tested?
updated testsuites