Skip to content

SPARK-1216. Add a OneHotEncoder for handling categorical features [MLLIB] #304

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

sryza
Copy link
Contributor

@sryza sryza commented Apr 2, 2014

No description provided.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

*
* An example usage is:
*
* val categoricalFields = Array(0, 7, 21)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can wrap code around using

{{{

}}}

so they get properly formatted in scaladoc

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13694/

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@sryza
Copy link
Contributor Author

sryza commented Apr 3, 2014

Thanks for the tip Reynold. Updated patch fixes the comments.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13722/

@sryza
Copy link
Contributor Author

sryza commented Apr 4, 2014

The test failures appear unrelated.

@pwendell
Copy link
Contributor

pwendell commented Apr 4, 2014

Are you sure it's not related - just wondering because it's also in the ML code...

org.apache.spark.SparkException: Job aborted: Task 0.0:0 failed 1 times (most recent failure: Exception failure in TID 0 on host localhost: java.lang.ClassCastException: org.apache.spark.mllib.regression.LabeledPoint cannot be cast to [Ljava.lang.Object;)

@srowen
Copy link
Member

srowen commented Apr 4, 2014

I glanced at it too and I don't think it's related:

MLUtilsSuite:
[info] - epsilon computation (1 millisecond)
[info] - fast squared distance (7 milliseconds)
[info] - compute stats *** FAILED *** (27 milliseconds)
...
[info] - loadLibSVMData *** FAILED *** (128 milliseconds)

This isn't in the new test and I don't think just adding a new class would affect these at all, unless there's some really deep voodoo here. That said I can't see why the builds around it didn't hit the same issue.

@mengxr
Copy link
Contributor

mengxr commented Apr 5, 2014

The error message is

Job aborted: Task 0.0:0 failed 1 times (most recent failure: Exception failure in TID 0 on host localhost: java.lang.ClassCastException: org.apache.spark.mllib.regression.LabeledPoint cannot be cast to [Ljava.lang.Object;)"

The tests runs fine on my local machine. Could it be a JVM issue?

@mengxr
Copy link
Contributor

mengxr commented Apr 5, 2014

Jenkins, retest this please.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13798/

@mengxr
Copy link
Contributor

mengxr commented Apr 7, 2014

@sryza Could you put sc.stop() at the end of your test or use LocalSparkContext for your test suite? I believe that caused the problem.

@sryza
Copy link
Contributor Author

sryza commented Apr 7, 2014

Ahh, makes sense. Posted a revision that uses LocalSparkContext.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13850/


import org.apache.spark.rdd.RDD

import scala.collection.mutable.HashSet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put scala import before spark imports. Also, better import scala.collection.mutable and in the code use mutable.HashSet and mutable.Set.

* that, for each field, describes the values that are present for in the dataset. The structure
* is meant to be used as input to encode.
*/
def categories(rdd: RDD[Array[Any]], categoricalFields: Seq[Int]): Array[Map[Any, Int]] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use a generic type here? So if a user input Array[Double], it will return Array[Map[Double, Int]].

@mengxr
Copy link
Contributor

mengxr commented Apr 9, 2014

@sryza I made one pass over the code. Besides the inline comments:

  1. The output of one-hot is always sparse, we should use sparse vector instead of dense.
  2. This is part of feature transformation. Using Array to store features would result reallocation of memory. We should spend more time on the data types.

@sryza
Copy link
Contributor Author

sryza commented Apr 13, 2014

Thanks for taking a look @mengxr. Working on a patch that addresses the inline comments. On the broader points:

We should spend more time on the data types.

Agreed. It would probably make sense to have some way of accepting sparse input, maybe just Map[Int, T]?

Using Array to store features would result reallocation of memory.

Do you mind elaborating on this a little more? How can we avoid the reallocation?

The output of one-hot is always sparse, we should use sparse vector instead of dense.

While one-hot increases the sparsity, in many cases a dense representation is still more efficient. I'm not sure where the boundary lies, but, in the extreme case, a long dense vector with few categorical variables that take on only a few categories will still do better with a dense representation after the transformation. In my opinion, we should give the user control and default to only outputting sparse vectors if the input type is sparse. What do you think?

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@sryza
Copy link
Contributor Author

sryza commented Apr 14, 2014

Updated patch addresses inline comments. This is my first time using generics in Scala, so let me know if there are any conventions I'm missing.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14109/

@karlhigley
Copy link

This looked useful, so I tried it out. It works as expected so long as the feature array is an Array[Any]. If the features are all categorical, then the feature array could be an Array[String]. In that case, encodeVec will instantiate an Array[String] for outVec, which results in an exception on line 125, because Array[String] won't accept an integer value.

@mengxr
Copy link
Contributor

mengxr commented Nov 7, 2014

@sryza With the proposed new API, the OneHotEncoder should be much easier to implement, which takes an integer/string/... column as input and output a sparse vector column. Are you interested in porting this to use the new API?

@sryza
Copy link
Contributor Author

sryza commented Nov 7, 2014

Definitely. Have been waiting for the Pipelines and Parameters PR goes in.

@srowen
Copy link
Member

srowen commented Feb 26, 2015

SPARK-5888 tracks the same addition but for the new API. Let's close this one and reopen a PR against that for the new impl.

@sryza sryza closed this Mar 10, 2015
arjunshroff pushed a commit to arjunshroff/spark that referenced this pull request Nov 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants