SPARK-1216. Add a OneHotEncoder for handling categorical features [MLLIB] #304

sryza · 2014-04-02T21:03:14Z

No description provided.

AmplabJenkins · 2014-04-02T21:07:23Z

Merged build triggered.

AmplabJenkins · 2014-04-02T21:07:28Z

Merged build started.

rxin · 2014-04-02T21:29:10Z

mllib/src/main/scala/org/apache/spark/mllib/preprocessing/OneHotEncoder.scala

+ *
+ * An example usage is:
+ *
+ *  val categoricalFields = Array(0, 7, 21)


you can wrap code around using

{{{ }}}

so they get properly formatted in scaladoc

AmplabJenkins · 2014-04-02T22:04:04Z

Merged build finished.

AmplabJenkins · 2014-04-02T22:04:04Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13694/

AmplabJenkins · 2014-04-03T08:07:23Z

Merged build triggered.

AmplabJenkins · 2014-04-03T08:07:32Z

Merged build started.

sryza · 2014-04-03T08:08:31Z

Thanks for the tip Reynold. Updated patch fixes the comments.

AmplabJenkins · 2014-04-03T09:04:39Z

Merged build finished.

AmplabJenkins · 2014-04-03T09:04:39Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13722/

sryza · 2014-04-04T16:59:23Z

The test failures appear unrelated.

pwendell · 2014-04-04T17:44:04Z

Are you sure it's not related - just wondering because it's also in the ML code...

org.apache.spark.SparkException: Job aborted: Task 0.0:0 failed 1 times (most recent failure: Exception failure in TID 0 on host localhost: java.lang.ClassCastException: org.apache.spark.mllib.regression.LabeledPoint cannot be cast to [Ljava.lang.Object;)

srowen · 2014-04-04T17:58:31Z

I glanced at it too and I don't think it's related:

MLUtilsSuite:
[info] - epsilon computation (1 millisecond)
[info] - fast squared distance (7 milliseconds)
[info] - compute stats *** FAILED *** (27 milliseconds)
...
[info] - loadLibSVMData *** FAILED *** (128 milliseconds)

This isn't in the new test and I don't think just adding a new class would affect these at all, unless there's some really deep voodoo here. That said I can't see why the builds around it didn't hit the same issue.

mengxr · 2014-04-05T19:04:34Z

The error message is

Job aborted: Task 0.0:0 failed 1 times (most recent failure: Exception failure in TID 0 on host localhost: java.lang.ClassCastException: org.apache.spark.mllib.regression.LabeledPoint cannot be cast to [Ljava.lang.Object;)"

The tests runs fine on my local machine. Could it be a JVM issue?

mengxr · 2014-04-05T19:04:42Z

Jenkins, retest this please.

AmplabJenkins · 2014-04-05T19:07:24Z

Merged build triggered.

AmplabJenkins · 2014-04-05T19:07:30Z

Merged build started.

AmplabJenkins · 2014-04-05T20:02:36Z

Merged build finished.

AmplabJenkins · 2014-04-05T20:02:36Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13798/

mengxr · 2014-04-07T17:20:29Z

@sryza Could you put sc.stop() at the end of your test or use LocalSparkContext for your test suite? I believe that caused the problem.

sryza · 2014-04-07T18:28:51Z

Ahh, makes sense. Posted a revision that uses LocalSparkContext.

AmplabJenkins · 2014-04-07T18:32:23Z

Merged build triggered.

AmplabJenkins · 2014-04-07T18:32:29Z

Merged build started.

AmplabJenkins · 2014-04-07T19:23:05Z

Merged build finished.

AmplabJenkins · 2014-04-07T19:23:05Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13850/

mengxr · 2014-04-09T21:43:58Z

mllib/src/main/scala/org/apache/spark/mllib/preprocessing/OneHotEncoder.scala

+
+import org.apache.spark.rdd.RDD
+
+import scala.collection.mutable.HashSet


Put scala import before spark imports. Also, better import scala.collection.mutable and in the code use mutable.HashSet and mutable.Set.

mengxr · 2014-04-09T21:55:44Z

mllib/src/main/scala/org/apache/spark/mllib/preprocessing/OneHotEncoder.scala

+   * that, for each field, describes the values that are present for in the dataset. The structure
+   * is meant to be used as input to encode.
+   */
+  def categories(rdd: RDD[Array[Any]], categoricalFields: Seq[Int]): Array[Map[Any, Int]] = {


Could we use a generic type here? So if a user input Array[Double], it will return Array[Map[Double, Int]].

mengxr · 2014-04-09T22:33:25Z

@sryza I made one pass over the code. Besides the inline comments:

The output of one-hot is always sparse, we should use sparse vector instead of dense.
This is part of feature transformation. Using Array to store features would result reallocation of memory. We should spend more time on the data types.

sryza · 2014-04-13T09:19:45Z

Thanks for taking a look @mengxr. Working on a patch that addresses the inline comments. On the broader points:

We should spend more time on the data types.

Agreed. It would probably make sense to have some way of accepting sparse input, maybe just Map[Int, T]?

Using Array to store features would result reallocation of memory.

Do you mind elaborating on this a little more? How can we avoid the reallocation?

The output of one-hot is always sparse, we should use sparse vector instead of dense.

While one-hot increases the sparsity, in many cases a dense representation is still more efficient. I'm not sure where the boundary lies, but, in the extreme case, a long dense vector with few categorical variables that take on only a few categories will still do better with a dense representation after the transformation. In my opinion, we should give the user control and default to only outputting sparse vectors if the input type is sparse. What do you think?

AmplabJenkins · 2014-04-14T05:33:11Z

Merged build triggered.

AmplabJenkins · 2014-04-14T05:33:20Z

Merged build started.

sryza · 2014-04-14T05:34:13Z

Updated patch addresses inline comments. This is my first time using generics in Scala, so let me know if there are any conventions I'm missing.

AmplabJenkins · 2014-04-14T06:12:20Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-04-14T06:12:20Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14109/

karlhigley · 2014-07-05T18:39:21Z

This looked useful, so I tried it out. It works as expected so long as the feature array is an Array[Any]. If the features are all categorical, then the feature array could be an Array[String]. In that case, encodeVec will instantiate an Array[String] for outVec, which results in an exception on line 125, because Array[String] won't accept an integer value.

mengxr · 2014-11-07T01:50:25Z

@sryza With the proposed new API, the OneHotEncoder should be much easier to implement, which takes an integer/string/... column as input and output a sparse vector column. Are you interested in porting this to use the new API?

sryza · 2014-11-07T01:54:36Z

Definitely. Have been waiting for the Pipelines and Parameters PR goes in.

srowen · 2015-02-26T11:24:25Z

SPARK-5888 tracks the same addition but for the new API. Let's close this one and reopen a PR against that for the new impl.

…3.1 (apache#304)

rxin reviewed Apr 2, 2014
View reviewed changes

sryza added 2 commits April 7, 2014 10:22

SPARK-1216. Add a OneHotEncoder for handling categorical features

78249c2

Use LocalSparkContext

ad17129

mengxr reviewed Apr 9, 2014
View reviewed changes

Incorporate feedback

8e0efba

sryza mentioned this pull request Nov 11, 2014

[SPARK-4081] [mllib] VectorIndexer #3000

Closed

sryza closed this Mar 10, 2015

arjunshroff pushed a commit to arjunshroff/spark that referenced this pull request Nov 24, 2020

MapR [SPARK-276] Update zookeeper dependency to v.3.4.11 for spark 2.…

20fea0e

…3.1 (apache#304)


		import org.apache.spark.rdd.RDD

		import scala.collection.mutable.HashSet

SPARK-1216. Add a OneHotEncoder for handling categorical features [MLLIB] #304

SPARK-1216. Add a OneHotEncoder for handling categorical features [MLLIB] #304

Uh oh!

Conversation

sryza commented Apr 2, 2014

Uh oh!

AmplabJenkins commented Apr 2, 2014

Uh oh!

AmplabJenkins commented Apr 2, 2014

Uh oh!

rxin Apr 2, 2014

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Apr 2, 2014

Uh oh!

AmplabJenkins commented Apr 2, 2014

Uh oh!

AmplabJenkins commented Apr 3, 2014

Uh oh!

AmplabJenkins commented Apr 3, 2014

Uh oh!

sryza commented Apr 3, 2014

Uh oh!

AmplabJenkins commented Apr 3, 2014

Uh oh!

AmplabJenkins commented Apr 3, 2014

Uh oh!

sryza commented Apr 4, 2014

Uh oh!

pwendell commented Apr 4, 2014

Uh oh!

srowen commented Apr 4, 2014

Uh oh!

mengxr commented Apr 5, 2014

Uh oh!

mengxr commented Apr 5, 2014

Uh oh!

AmplabJenkins commented Apr 5, 2014

Uh oh!

AmplabJenkins commented Apr 5, 2014

Uh oh!

AmplabJenkins commented Apr 5, 2014

Uh oh!

AmplabJenkins commented Apr 5, 2014

Uh oh!

mengxr commented Apr 7, 2014

Uh oh!

sryza commented Apr 7, 2014

Uh oh!

AmplabJenkins commented Apr 7, 2014

Uh oh!

AmplabJenkins commented Apr 7, 2014

Uh oh!

AmplabJenkins commented Apr 7, 2014

Uh oh!

AmplabJenkins commented Apr 7, 2014

Uh oh!

mengxr Apr 9, 2014

Choose a reason for hiding this comment

Uh oh!

mengxr Apr 9, 2014

Choose a reason for hiding this comment

Uh oh!

mengxr commented Apr 9, 2014

Uh oh!

sryza commented Apr 13, 2014

Uh oh!

AmplabJenkins commented Apr 14, 2014

Uh oh!

AmplabJenkins commented Apr 14, 2014

Uh oh!

sryza commented Apr 14, 2014

Uh oh!

AmplabJenkins commented Apr 14, 2014

Uh oh!

AmplabJenkins commented Apr 14, 2014

Uh oh!

karlhigley commented Jul 5, 2014

Uh oh!

mengxr commented Nov 7, 2014