[SPARK-4081] [mllib] VectorIndexer #3000

jkbradley · 2014-10-29T19:11:18Z

Ready for review!

Since the original PR, I moved the code to the spark.ml API and renamed this to VectorIndexer.

This introduces a VectorIndexer class which does the following:

VectorIndexer.fit(): collect statistics about how many values each feature in a dataset (RDD[Vector]) can take (limited by maxCategories)
- Feature which exceed maxCategories are declared continuous, and the Model will treat them as such.
VectorIndexerModel.transform(): Convert categorical feature values to corresponding 0-based indices

Design notes:

This maintains sparsity in vectors by ensuring that categorical feature value 0.0 gets index 0.
This does not yet support transforming data with new (unknown) categorical feature values. That can be added later.
This is necessary for DecisionTree and tree ensembles.

Reviewers: Please check my use of metadata and my unit tests for it; I'm not sure if I covered everything in the tests.

Other notes:

This also adds a public toMetadata method to AttributeGroup (for simpler construction of metadata).

CC: @mengxr

SparkQA · 2014-10-29T20:50:22Z

Test build #22463 has finished for PR 3000 at commit fc781bd.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class DatasetIndexer(

manishamde · 2014-10-30T20:22:13Z

mllib/src/main/scala/org/apache/spark/mllib/feature/DatasetIndexer.scala

+ *
+ * This helps process a dataset of unknown vectors into a dataset with some continuous features
+ * and some categorical features. The choice between continuous and categorical is based upon
+ * a maxCategories parameter.


maxCategories as a threshold is a good default. In the future, we may want to add different criteria for some features. Thoughts? Moreover, should we have a reasonable default value?

I agree about adding more criteria and options later on.
For a default value, does 16 seem reasonable?

Sure. I was going to suggest 32 but 16 is reasonable as well. I think R only supports up to 32 for categorical features so that was my motivation.

OK, 32 sounds good

manishamde · 2014-10-30T20:47:40Z

How about the transformation for labels? This will help with transformations for classification especially from +1/-1 to 0/1 labeling for binary classification.

jkbradley · 2014-10-30T21:00:01Z

Good point; I intended this to be used for labels too, so I'll add fit() and transform() methods which take RDD[Double]. Perhaps I should relabel "features" to "columns." I'd imagine someone either using 2 indexers (1 for labels and 1 for features), or zipping the labels and features into 1 vector and then using 1 indexer. We could also add other fit() and transform() methods later on to prevent users from having to do the zipping manually.

manishamde · 2014-10-30T21:02:37Z

Agree.

manishamde · 2014-10-30T21:05:32Z

mllib/src/main/scala/org/apache/spark/mllib/feature/DatasetIndexer.scala

+   *
+   * @param data  Dataset with equal-length vectors.
+   *              NOTE: A single instance of [[DatasetIndexer]] must always be given vectors of
+   *              the same length.  If given non-matching vectors, this method will throw an error.


Minor: extra space.

jkbradley · 2014-10-31T21:35:15Z

@manishamde Thanks for the feedback! I realized I can't really include a fit(RDD[Double]) method since it conflicts with fit(RDD[Vector]). This is because erasure strips away the Double/Vector to get type fit(RDD[_]). I instead included a note about mapping to Vector.

I believe the PR is ready. I removed the non-implemented parameter for unrecognized categories, to be added later.

manishamde · 2014-10-31T21:36:58Z

Cool. I will make another pass shortly.

SparkQA · 2014-10-31T22:35:00Z

Test build #22642 has finished for PR 3000 at commit 831aa92.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class DatasetIndexer(val maxCategories: Int) extends Logging with Serializable

manishamde · 2014-11-01T01:16:07Z

LGTM.

SparkQA · 2014-11-03T02:19:52Z

Test build #507 has finished for PR 3000 at commit 831aa92.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- class DatasetIndexer(val maxCategories: Int) extends Logging with Serializable

SparkQA · 2014-11-03T02:29:19Z

Test build #22787 has finished for PR 3000 at commit ee495e4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-11-04T20:53:40Z

Test build #22890 has finished for PR 3000 at commit 0d947cb.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class DatasetIndexer(val maxCategories: Int) extends Logging with Serializable
- class RDDFunctions[T: ClassTag](self: RDD[T]) extends Serializable

SparkQA · 2014-11-04T23:43:29Z

Test build #22901 has finished for PR 3000 at commit aed6bb3.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class DatasetIndexer(val maxCategories: Int) extends Logging with Serializable

sryza · 2014-11-11T18:01:37Z

Just noticed this. I'd been working on something similar a little while ago on SPARK-1216 / #304. One difference is that I had aimed to accept categorical features that are strings, as input data commonly comes this way. Do you think that functionality should come here or in a separate PR?

jkbradley · 2014-11-11T22:03:05Z

@sryza Hi, yes, I didn't realize that they shared some functionality. It would be great to coordinate. I think these 2 types of feature transformations are pretty different, but there is some shared underlying functionality.
Feature operations:

Decide which features should be categorical (this PR)
Relabel categorical feature values based on an index (this PR)
Create new features by expanding a categorical feature (your PR)
Count statistics about dataset columns (both PRs)
The first 3 operations seem fairly distinct to me. But the last one (which does not really need to be exposed to users) could definitely be shared.

We both need to know how many distinct values there are in a column, with some extra options. (You need to specify a subset of columns, and I need to limit the number of distinct values at some point.) Perhaps we could combine these into some sort of stats collector (maybe private[mllib] for now?) which we can both use. I'd be happy to do that, or let me know if you'd like to.

sryza · 2015-02-12T08:33:47Z

@jkbradley sorry for the delay in responding here. Your breakdown of operations makes sense to me.

A stats collector seems like a good idea. I also wonder if there's some way to hook it in with Hive table statistics so we can avoid a pass over the data, but maybe that should be saved for future work. If you aren't planning to get to this in the near future, but think you'll have bandwidth to review, I'd be happy to work on it. Otherwise, I'm happy to look over whatever you put up.

jkbradley · 2015-02-17T02:52:40Z

@sryza Thanks for offering! That would be great if you have the bandwidth to work on this. I'd be happy to help review.

One comment: It would be nice to be able to take advantage of FeatureAttributes in the spark.ml package, but that's a WIP right now: [https://github.com//pull/4460]

…exer

SparkQA · 2015-04-03T22:37:48Z

Test build #29694 has finished for PR 3000 at commit 286d221.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class VectorIndexer extends Estimator[VectorIndexerModel] with VectorIndexerParams
This patch does not change any dependencies.

jkbradley · 2015-04-10T04:06:45Z

I think it's ready now. I'll add a quick Java unit test soon to make sure getters/setters work correctly.

jkbradley · 2015-04-10T04:30:38Z

I feel like the code could be a bit shorter; I'll think about that more tomorrow and whether we can make working with DataFrames and metadata easier in general.

SparkQA · 2015-04-10T05:34:51Z

Test build #30002 has finished for PR 3000 at commit 02236c3.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class VectorIndexer extends Estimator[VectorIndexerModel] with VectorIndexerParams
This patch does not change any dependencies.

SparkQA · 2015-04-10T05:44:09Z

Test build #30003 has finished for PR 3000 at commit 643b444.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class VectorIndexer extends Estimator[VectorIndexerModel] with VectorIndexerParams
This patch does not change any dependencies.

mengxr · 2015-04-10T06:06:41Z

@jkbradley I have a question about the expected behavior. Say I have a vector column containing 2 features. One is categorical: 0, 1, 2, 3, 4, 5 and the other is continuous but only takes values from 1.0, 2.0, 4.0. Then if I set maxCategories to 10, both will be recognized as categorical and the mapping for the second feature may become something like 1.0 -> 0, 2.0 -> 1, 4.0 -> 2. Is it what we expect?

jkbradley · 2015-04-10T21:11:41Z

@mengxr Yes, that's what we'd expect. Eventually, we'd want to be able to specify which features to index, either (a) via another parameter specifying specific features to index or (b) via metadata, where we do not index features which already have metadata.

jkbradley · 2015-04-11T15:31:46Z

mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala

+      require(numFeatures == actualNumFeatures, "VectorIndexerModel expected vector of length" +
+        s" $numFeatures but found length $actualNumFeatures")
+    }
+    dataset.withColumn(map(outputCol), newCol.as(map(outputCol), newField.metadata))


It'd be nice for withColumn to take metadata. I'll make a JIRA for that.

jkbradley · 2015-04-11T15:46:39Z

I did some minor cleanups, but I don't see any great places to remove code. I added a Java test suite.

SparkQA · 2015-04-11T16:44:56Z

Test build #30078 has finished for PR 3000 at commit f5c57a8.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class VectorIndexer extends Estimator[VectorIndexerModel] with VectorIndexerParams
This patch does not change any dependencies.

SparkQA · 2015-04-11T17:22:15Z

Test build #30082 has finished for PR 3000 at commit 5956d91.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class VectorIndexer extends Estimator[VectorIndexerModel] with VectorIndexerParams
This patch does not change any dependencies.

mengxr · 2015-04-13T05:39:01Z

LGTM. Merged into master. Thanks, and sorry for the long delay!

manishamde reviewed Oct 30, 2014
View reviewed changes

jkbradley mentioned this pull request Feb 24, 2015

[SPARK-5886][ML] Add StringIndexer as a feature transformer #4735

Closed

1 task

jkbradley added 9 commits February 25, 2015 23:23

working on DatasetIndexer

5e7c874

partly done with DatasetIndexerSuite

f409987

DatasetIndexer now passes tests

2006923

Added another test for DatasetIndexer

3a4a0bd

DatasetIndexer now maintains sparsity in SparseVector

038b9e3

Final cleanups for DatasetIndexer

3f041f8

Updated TODO for allowUnknownCategories

6a2f553

Added partly done DatasetIndexer to spark.ml

6d8f3f1

Merge remote-tracking branch 'upstream/master' into indexer

12e6cf2

Reworked DatasetIndexer for spark.ml API, and renamed it to VectorInd…

286d221

…exer

jkbradley force-pushed the indexer branch from aed6bb3 to 286d221 Compare April 3, 2015 21:09

jkbradley changed the title ~~[SPARK-4081] [mllib] DatasetIndexer~~ [SPARK-4081] [mllib] VectorIndexer Apr 3, 2015

jkbradley changed the title ~~[SPARK-4081] [mllib] VectorIndexer~~ [WIP] [SPARK-4081] [mllib] VectorIndexer Apr 3, 2015

Updated VectorIndexer, ready for PR

02236c3

jkbradley changed the title ~~[WIP] [SPARK-4081] [mllib] VectorIndexer~~ [SPARK-4081] [mllib] VectorIndexer Apr 10, 2015

removed FeatureTests

643b444

added Java test suite

f5c57a8

jkbradley reviewed Apr 11, 2015
View reviewed changes

minor cleanups

5956d91

asfgit closed this in d3792f5 Apr 13, 2015

jkbradley deleted the indexer branch May 4, 2015 23:04

[SPARK-4081] [mllib] VectorIndexer #3000

[SPARK-4081] [mllib] VectorIndexer #3000

Uh oh!

Conversation

jkbradley commented Oct 29, 2014

Uh oh!

SparkQA commented Oct 29, 2014

Uh oh!

manishamde Oct 30, 2014

Choose a reason for hiding this comment

Uh oh!

jkbradley Oct 30, 2014

Choose a reason for hiding this comment

Uh oh!

manishamde Oct 30, 2014

Choose a reason for hiding this comment

Uh oh!

jkbradley Oct 30, 2014

Choose a reason for hiding this comment

Uh oh!

manishamde commented Oct 30, 2014

Uh oh!

jkbradley commented Oct 30, 2014

Uh oh!

manishamde commented Oct 30, 2014

Uh oh!

manishamde Oct 30, 2014

Choose a reason for hiding this comment

Uh oh!

jkbradley commented Oct 31, 2014

Uh oh!

manishamde commented Oct 31, 2014

Uh oh!

SparkQA commented Oct 31, 2014

Uh oh!

manishamde commented Nov 1, 2014

Uh oh!

SparkQA commented Nov 3, 2014

Uh oh!

SparkQA commented Nov 3, 2014

Uh oh!

SparkQA commented Nov 4, 2014

Uh oh!

SparkQA commented Nov 4, 2014

Uh oh!

sryza commented Nov 11, 2014

Uh oh!

jkbradley commented Nov 11, 2014

Uh oh!

sryza commented Feb 12, 2015

Uh oh!

jkbradley commented Feb 17, 2015

Uh oh!

SparkQA commented Apr 3, 2015

Uh oh!

jkbradley commented Apr 10, 2015

Uh oh!

jkbradley commented Apr 10, 2015

Uh oh!

SparkQA commented Apr 10, 2015

Uh oh!

SparkQA commented Apr 10, 2015

Uh oh!

mengxr commented Apr 10, 2015

Uh oh!

jkbradley commented Apr 10, 2015

Uh oh!

jkbradley Apr 11, 2015

Choose a reason for hiding this comment

Uh oh!

jkbradley commented Apr 11, 2015

Uh oh!

SparkQA commented Apr 11, 2015

Uh oh!

SparkQA commented Apr 11, 2015

Uh oh!

mengxr commented Apr 13, 2015

Uh oh!

Uh oh!