-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-4081] [mllib] VectorIndexer #3000
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #22463 has finished for PR 3000 at commit
|
* | ||
* This helps process a dataset of unknown vectors into a dataset with some continuous features | ||
* and some categorical features. The choice between continuous and categorical is based upon | ||
* a maxCategories parameter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maxCategories as a threshold is a good default. In the future, we may want to add different criteria for some features. Thoughts? Moreover, should we have a reasonable default value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree about adding more criteria and options later on.
For a default value, does 16 seem reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I was going to suggest 32 but 16 is reasonable as well. I think R only supports up to 32 for categorical features so that was my motivation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, 32 sounds good
How about the transformation for labels? This will help with transformations for classification especially from +1/-1 to 0/1 labeling for binary classification. |
Good point; I intended this to be used for labels too, so I'll add fit() and transform() methods which take RDD[Double]. Perhaps I should relabel "features" to "columns." I'd imagine someone either using 2 indexers (1 for labels and 1 for features), or zipping the labels and features into 1 vector and then using 1 indexer. We could also add other fit() and transform() methods later on to prevent users from having to do the zipping manually. |
Agree. |
* | ||
* @param data Dataset with equal-length vectors. | ||
* NOTE: A single instance of [[DatasetIndexer]] must always be given vectors of | ||
* the same length. If given non-matching vectors, this method will throw an error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: extra space.
@manishamde Thanks for the feedback! I realized I can't really include a fit(RDD[Double]) method since it conflicts with fit(RDD[Vector]). This is because erasure strips away the Double/Vector to get type fit(RDD[_]). I instead included a note about mapping to Vector. I believe the PR is ready. I removed the non-implemented parameter for unrecognized categories, to be added later. |
Cool. I will make another pass shortly. |
Test build #22642 has finished for PR 3000 at commit
|
LGTM. |
Test build #507 has finished for PR 3000 at commit
|
Test build #22787 has finished for PR 3000 at commit
|
Test build #22890 has finished for PR 3000 at commit
|
Test build #22901 has finished for PR 3000 at commit
|
Just noticed this. I'd been working on something similar a little while ago on SPARK-1216 / #304. One difference is that I had aimed to accept categorical features that are strings, as input data commonly comes this way. Do you think that functionality should come here or in a separate PR? |
@sryza Hi, yes, I didn't realize that they shared some functionality. It would be great to coordinate. I think these 2 types of feature transformations are pretty different, but there is some shared underlying functionality.
We both need to know how many distinct values there are in a column, with some extra options. (You need to specify a subset of columns, and I need to limit the number of distinct values at some point.) Perhaps we could combine these into some sort of stats collector (maybe private[mllib] for now?) which we can both use. I'd be happy to do that, or let me know if you'd like to. |
@jkbradley sorry for the delay in responding here. Your breakdown of operations makes sense to me. A stats collector seems like a good idea. I also wonder if there's some way to hook it in with Hive table statistics so we can avoid a pass over the data, but maybe that should be saved for future work. If you aren't planning to get to this in the near future, but think you'll have bandwidth to review, I'd be happy to work on it. Otherwise, I'm happy to look over whatever you put up. |
@sryza Thanks for offering! That would be great if you have the bandwidth to work on this. I'd be happy to help review. One comment: It would be nice to be able to take advantage of FeatureAttributes in the spark.ml package, but that's a WIP right now: [https://github.com//pull/4460] |
Test build #29694 has finished for PR 3000 at commit
|
I think it's ready now. I'll add a quick Java unit test soon to make sure getters/setters work correctly. |
I feel like the code could be a bit shorter; I'll think about that more tomorrow and whether we can make working with DataFrames and metadata easier in general. |
Test build #30002 has finished for PR 3000 at commit
|
Test build #30003 has finished for PR 3000 at commit
|
@jkbradley I have a question about the expected behavior. Say I have a vector column containing 2 features. One is categorical: 0, 1, 2, 3, 4, 5 and the other is continuous but only takes values from 1.0, 2.0, 4.0. Then if I set |
@mengxr Yes, that's what we'd expect. Eventually, we'd want to be able to specify which features to index, either (a) via another parameter specifying specific features to index or (b) via metadata, where we do not index features which already have metadata. |
require(numFeatures == actualNumFeatures, "VectorIndexerModel expected vector of length" + | ||
s" $numFeatures but found length $actualNumFeatures") | ||
} | ||
dataset.withColumn(map(outputCol), newCol.as(map(outputCol), newField.metadata)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice for withColumn to take metadata. I'll make a JIRA for that.
I did some minor cleanups, but I don't see any great places to remove code. I added a Java test suite. |
Test build #30078 has finished for PR 3000 at commit
|
Test build #30082 has finished for PR 3000 at commit
|
LGTM. Merged into master. Thanks, and sorry for the long delay! |
Ready for review!
Since the original PR, I moved the code to the spark.ml API and renamed this to VectorIndexer.
This introduces a VectorIndexer class which does the following:
Design notes:
Reviewers: Please check my use of metadata and my unit tests for it; I'm not sure if I covered everything in the tests.
Other notes:
CC: @mengxr