-
Notifications
You must be signed in to change notification settings - Fork 28.6k
SPARK-1216. Add a OneHotEncoder for handling categorical features [MLLIB] #304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Merged build triggered. |
Merged build started. |
* | ||
* An example usage is: | ||
* | ||
* val categoricalFields = Array(0, 7, 21) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can wrap code around using
{{{
}}}
so they get properly formatted in scaladoc
Merged build finished. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13694/ |
Merged build triggered. |
Merged build started. |
Thanks for the tip Reynold. Updated patch fixes the comments. |
Merged build finished. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13722/ |
The test failures appear unrelated. |
Are you sure it's not related - just wondering because it's also in the ML code...
|
I glanced at it too and I don't think it's related:
This isn't in the new test and I don't think just adding a new class would affect these at all, unless there's some really deep voodoo here. That said I can't see why the builds around it didn't hit the same issue. |
The error message is
The tests runs fine on my local machine. Could it be a JVM issue? |
Jenkins, retest this please. |
Merged build triggered. |
Merged build started. |
Merged build finished. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13798/ |
@sryza Could you put |
Ahh, makes sense. Posted a revision that uses LocalSparkContext. |
Merged build triggered. |
Merged build started. |
Merged build finished. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13850/ |
|
||
import org.apache.spark.rdd.RDD | ||
|
||
import scala.collection.mutable.HashSet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put scala import before spark imports. Also, better import scala.collection.mutable
and in the code use mutable.HashSet
and mutable.Set
.
* that, for each field, describes the values that are present for in the dataset. The structure | ||
* is meant to be used as input to encode. | ||
*/ | ||
def categories(rdd: RDD[Array[Any]], categoricalFields: Seq[Int]): Array[Map[Any, Int]] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use a generic type here? So if a user input Array[Double]
, it will return Array[Map[Double, Int]]
.
@sryza I made one pass over the code. Besides the inline comments:
|
Thanks for taking a look @mengxr. Working on a patch that addresses the inline comments. On the broader points:
Agreed. It would probably make sense to have some way of accepting sparse input, maybe just Map[Int, T]?
Do you mind elaborating on this a little more? How can we avoid the reallocation?
While one-hot increases the sparsity, in many cases a dense representation is still more efficient. I'm not sure where the boundary lies, but, in the extreme case, a long dense vector with few categorical variables that take on only a few categories will still do better with a dense representation after the transformation. In my opinion, we should give the user control and default to only outputting sparse vectors if the input type is sparse. What do you think? |
Merged build triggered. |
Merged build started. |
Updated patch addresses inline comments. This is my first time using generics in Scala, so let me know if there are any conventions I'm missing. |
Merged build finished. All automated tests passed. |
All automated tests passed. |
This looked useful, so I tried it out. It works as expected so long as the feature array is an |
@sryza With the proposed new API, the OneHotEncoder should be much easier to implement, which takes an integer/string/... column as input and output a sparse vector column. Are you interested in porting this to use the new API? |
Definitely. Have been waiting for the Pipelines and Parameters PR goes in. |
SPARK-5888 tracks the same addition but for the new API. Let's close this one and reopen a PR against that for the new impl. |
No description provided.