Skip to content

[SPARK-22797][PySpark] Bucketizer support multi-column #19892

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from

Conversation

zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

Bucketizer support multi-column in the python side

How was this patch tested?

existing tests and added tests

@SparkQA
Copy link

SparkQA commented Dec 5, 2017

Test build #84478 has finished for PR 19892 at commit 5ed91fd.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 5, 2017

Test build #84479 has finished for PR 19892 at commit 906a81d.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 5, 2017

Test build #84480 has finished for PR 19892 at commit 5efc94e.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

This PR is currently blocked by #19894 (comment)

@zhengruifeng
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Dec 14, 2017

Test build #84898 has finished for PR 19892 at commit 5efc94e.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Dec 14, 2017

Test build #84900 has finished for PR 19892 at commit 5efc94e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

ping @holdenk , can you help reviewing this?

class Bucketizer(JavaTransformer, HasInputCol, HasOutputCol, HasHandleInvalid,
JavaMLReadable, JavaMLWritable):
class Bucketizer(JavaTransformer, HasInputCol, HasOutputCol, HasInputCols, HasOutputCols,
HasHandleInvalid, JavaMLReadable, JavaMLWritable):
"""
Maps a column of continuous features to a column of feature buckets.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update this doc too?

if all(map(lambda v: TypeConverters._can_convert_to_list(v), value)):
ll = []
for v in value:
ll.append([float(i) for i in TypeConverters.toList(v)])
Copy link
Member

@viirya viirya Dec 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toListFloat requires each list entry is a numeric (TypeConverters._is_numeric). Should toListListFloat have such requirement?

@viirya
Copy link
Member

viirya commented Dec 18, 2017

This needs an individual JIRA. @MLnick created SPARK-22797 for this. Please use it.

@zhengruifeng
Copy link
Contributor Author

@viirya Thanks a lot for reviewing this! I will update the title to use the new ticket.

@zhengruifeng zhengruifeng changed the title [SPARK-20542][FollowUp][PySpark] Bucketizer support multi-column [SPARK-22797][PySpark] Bucketizer support multi-column Dec 19, 2017
@SparkQA
Copy link

SparkQA commented Dec 19, 2017

Test build #85087 has finished for PR 19892 at commit e1fb379.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Maps a column of continuous features to a column of feature buckets. Since 2.3.0,
:py:class:`Bucketizer` can map multiple columns at once by setting the :py:attr:`inputCols`
parameter. Note that when both the :py:attr:`inputCol` and :py:attr:`inputCols` parameters
are set, a log warning will be printed and only :py:attr:`inputCol` will take effect, while
Copy link
Member

@viirya viirya Dec 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: there is a work to change this behavior to throw an exception, instead of a log warning. We should change this document later.

Copy link
Contributor

@MLnick MLnick Jan 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@holdenk @zhengruifeng this comment will need to be changed as per #19993 - but that has not been merged yet. I think #19993 will block 2.3 though, so we could preemptively change the doc here to match the Scala side in #19993 about throwing an exception.

@viirya
Copy link
Member

viirya commented Dec 19, 2017

One minor comment, otherwise LGTM.

@zhengruifeng
Copy link
Contributor Author

ping @MLnick ?

@SparkQA
Copy link

SparkQA commented Jan 12, 2018

Test build #86009 has finished for PR 19892 at commit 4b2fc6f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

>>> bucketed2 = bucketizer2.setHandleInvalid("keep").transform(df).collect()
>>> len(bucketed2)
6
>>> bucketed2[0].buckets1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it would be cleaner to do a df.show() here? Likewise above for bucketed we could change that part of the doctest too.


>>> values = [(0.1, 0.0), (0.4, 1.0), (1.2, 1.3), (1.5, float("nan")),
... (float("nan"), 1.0), (float("nan"), 0.0)]
>>> df = spark.createDataFrame(values, ["values", "numbers"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

values1 & values2?

@@ -134,6 +134,16 @@ def toListFloat(value):
return [float(v) for v in value]
raise TypeError("Could not convert %s to list of floats" % value)

@staticmethod
def toListListFloat(value):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a test case in ParamTypeConversionTests for this new method; see test_list_float for reference.

@SparkQA
Copy link

SparkQA commented Jan 16, 2018

Test build #86168 has finished for PR 19892 at commit e869e75.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 16, 2018

Test build #86169 has finished for PR 19892 at commit 9f20f5c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

@MLnick Thanks for your reviewing and suggestions. I have updated this PR

Copy link
Contributor

@MLnick MLnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment on the updated doctest.

I don't think this will make it into 2.3 given the code freeze and branch has been cut already. In which case we will need to change the @since tags.

Pending the error throwing PR for Scala Bucketizer, we can update the doc here.

>>> bucketed[3].buckets
2.0
... inputCol="values1", outputCol="buckets")
>>> bucketed = bucketizer.setHandleInvalid("keep").transform(df)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may actually be neater to show only values1 and bucketed - so perhaps .transform(df.select('values1'))?

@SparkQA
Copy link

SparkQA commented Jan 16, 2018

Test build #86170 has finished for PR 19892 at commit 734db50.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Jan 19, 2018

I mean I think it might have a chance, generally speaking we've allowed outstanding PRs to be merged after the freeze. Since there are outstanding blockers on the branch preventing us from cutting RC2 maybe its ok to move forward if we can do it quickly? Of course I defer to MLNick :)

@MLnick
Copy link
Contributor

MLnick commented Jan 19, 2018 via email

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and I'll merge today (Australia time) if there are no objections. (note: this means also waiting on @MLnick switching from -1 to approve).

@MLnick
Copy link
Contributor

MLnick commented Jan 22, 2018

If it is going to get merged to branch-2.3 the since tags need to be 2.3.0 again

@SparkQA
Copy link

SparkQA commented Jan 22, 2018

Test build #86464 has finished for PR 19892 at commit 014fb08.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Jan 22, 2018

@MLnick you ok with this then?

@MLnick
Copy link
Contributor

MLnick commented Jan 22, 2018

@holdenk everything except my comment in #19892 (comment). I'd propose to just preemptively update the doc about an exception being thrown.

@SparkQA
Copy link

SparkQA commented Jan 23, 2018

Test build #86519 has finished for PR 19892 at commit ad5d81d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor

MLnick commented Jan 23, 2018

RC2 has been cut - @jkbradley do you see #19993 as a blocker? I think it should be merged for 2.3. And also there are QA JIRAs (sub-tasks of SPARK-23105) that are blockers that are not reflected in the list of blockers for 2.3 as they are not targeted.

@MLnick
Copy link
Contributor

MLnick commented Jan 26, 2018

Merged to master / branch-2.3. Thanks!

asfgit pushed a commit that referenced this pull request Jan 26, 2018
## What changes were proposed in this pull request?
Bucketizer support multi-column in the python side

## How was this patch tested?
existing tests and added tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #19892 from zhengruifeng/20542_py.

(cherry picked from commit c22eaa9)
Signed-off-by: Nick Pentreath <nickp@za.ibm.com>
@asfgit asfgit closed this in c22eaa9 Jan 26, 2018
@MLnick
Copy link
Contributor

MLnick commented Jan 26, 2018

I reverted this (see #20410 for details) - we can re-open it once that issue is solved.

@viirya
Copy link
Member

viirya commented Apr 19, 2018

Can this be re-open now?

@sherrysk8r
Copy link

Has there been any update on re-opening this? And also adding multiple column support to QuantileDiscretizer?

@zhengruifeng
Copy link
Contributor Author

I am sorry, I am afread I can not re-open this PR beacuse I deleted it by mistake.
I just create another PR #25801 to continue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants