-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-22797][PySpark] Bucketizer support multi-column #19892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #84478 has finished for PR 19892 at commit
|
Test build #84479 has finished for PR 19892 at commit
|
Test build #84480 has finished for PR 19892 at commit
|
This PR is currently blocked by #19894 (comment) |
retest this please |
Test build #84898 has finished for PR 19892 at commit
|
retest this please |
Test build #84900 has finished for PR 19892 at commit
|
ping @holdenk , can you help reviewing this? |
python/pyspark/ml/feature.py
Outdated
class Bucketizer(JavaTransformer, HasInputCol, HasOutputCol, HasHandleInvalid, | ||
JavaMLReadable, JavaMLWritable): | ||
class Bucketizer(JavaTransformer, HasInputCol, HasOutputCol, HasInputCols, HasOutputCols, | ||
HasHandleInvalid, JavaMLReadable, JavaMLWritable): | ||
""" | ||
Maps a column of continuous features to a column of feature buckets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update this doc too?
python/pyspark/ml/param/__init__.py
Outdated
if all(map(lambda v: TypeConverters._can_convert_to_list(v), value)): | ||
ll = [] | ||
for v in value: | ||
ll.append([float(i) for i in TypeConverters.toList(v)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
toListFloat
requires each list entry is a numeric (TypeConverters._is_numeric
). Should toListListFloat
have such requirement?
This needs an individual JIRA. @MLnick created SPARK-22797 for this. Please use it. |
@viirya Thanks a lot for reviewing this! I will update the title to use the new ticket. |
5efc94e
to
e1fb379
Compare
Test build #85087 has finished for PR 19892 at commit
|
python/pyspark/ml/feature.py
Outdated
Maps a column of continuous features to a column of feature buckets. Since 2.3.0, | ||
:py:class:`Bucketizer` can map multiple columns at once by setting the :py:attr:`inputCols` | ||
parameter. Note that when both the :py:attr:`inputCol` and :py:attr:`inputCols` parameters | ||
are set, a log warning will be printed and only :py:attr:`inputCol` will take effect, while |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: there is a work to change this behavior to throw an exception, instead of a log warning. We should change this document later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@holdenk @zhengruifeng this comment will need to be changed as per #19993 - but that has not been merged yet. I think #19993 will block 2.3 though, so we could preemptively change the doc here to match the Scala side in #19993 about throwing an exception.
One minor comment, otherwise LGTM. |
ping @MLnick ? |
e1fb379
to
4b2fc6f
Compare
Test build #86009 has finished for PR 19892 at commit
|
python/pyspark/ml/feature.py
Outdated
>>> bucketed2 = bucketizer2.setHandleInvalid("keep").transform(df).collect() | ||
>>> len(bucketed2) | ||
6 | ||
>>> bucketed2[0].buckets1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it would be cleaner to do a df.show()
here? Likewise above for bucketed
we could change that part of the doctest too.
python/pyspark/ml/feature.py
Outdated
|
||
>>> values = [(0.1, 0.0), (0.4, 1.0), (1.2, 1.3), (1.5, float("nan")), | ||
... (float("nan"), 1.0), (float("nan"), 0.0)] | ||
>>> df = spark.createDataFrame(values, ["values", "numbers"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values1
& values2
?
@@ -134,6 +134,16 @@ def toListFloat(value): | |||
return [float(v) for v in value] | |||
raise TypeError("Could not convert %s to list of floats" % value) | |||
|
|||
@staticmethod | |||
def toListListFloat(value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a test case in ParamTypeConversionTests
for this new method; see test_list_float
for reference.
4b2fc6f
to
e869e75
Compare
Test build #86168 has finished for PR 19892 at commit
|
Test build #86169 has finished for PR 19892 at commit
|
@MLnick Thanks for your reviewing and suggestions. I have updated this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor comment on the updated doctest.
I don't think this will make it into 2.3
given the code freeze and branch has been cut already. In which case we will need to change the @since
tags.
Pending the error throwing PR for Scala Bucketizer
, we can update the doc here.
python/pyspark/ml/feature.py
Outdated
>>> bucketed[3].buckets | ||
2.0 | ||
... inputCol="values1", outputCol="buckets") | ||
>>> bucketed = bucketizer.setHandleInvalid("keep").transform(df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may actually be neater to show only values1
and bucketed
- so perhaps .transform(df.select('values1'))
?
Test build #86170 has finished for PR 19892 at commit
|
I mean I think it might have a chance, generally speaking we've allowed outstanding PRs to be merged after the freeze. Since there are outstanding blockers on the branch preventing us from cutting RC2 maybe its ok to move forward if we can do it quickly? Of course I defer to MLNick :) |
I’m generally ok with these small python api wrapper additions getting
merged as long as the risk of breaking anything is low - and here it is
since it’s just api parity
…On Fri, 19 Jan 2018 at 06:08, Holden Karau ***@***.***> wrote:
I mean I think it might have a chance, generally speaking we've allowed
outstanding PRs to be merged after the freeze. Since there are outstanding
blockers on the branch preventing us from cutting RC2 maybe its ok to move
forward if we can do it quickly? Of course I defer to MLNick :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19892 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA_SBy2cr5MUJ9rN7egqwHf9GCLH0tCiks5tMBVSgaJpZM4Q2CRd>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM and I'll merge today (Australia time) if there are no objections. (note: this means also waiting on @MLnick switching from -1 to approve).
If it is going to get merged to |
Test build #86464 has finished for PR 19892 at commit
|
@MLnick you ok with this then? |
@holdenk everything except my comment in #19892 (comment). I'd propose to just preemptively update the doc about an exception being thrown. |
Test build #86519 has finished for PR 19892 at commit
|
RC2 has been cut - @jkbradley do you see #19993 as a blocker? I think it should be merged for |
Merged to master / branch-2.3. Thanks! |
## What changes were proposed in this pull request? Bucketizer support multi-column in the python side ## How was this patch tested? existing tests and added tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #19892 from zhengruifeng/20542_py. (cherry picked from commit c22eaa9) Signed-off-by: Nick Pentreath <nickp@za.ibm.com>
I reverted this (see #20410 for details) - we can re-open it once that issue is solved. |
Can this be re-open now? |
Has there been any update on re-opening this? And also adding multiple column support to QuantileDiscretizer? |
I am sorry, I am afread I can not re-open this PR beacuse I deleted it by mistake. |
What changes were proposed in this pull request?
Bucketizer support multi-column in the python side
How was this patch tested?
existing tests and added tests