Skip to content

[SPARK-4081] [mllib] VectorIndexer #3000

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 14 commits into from
Closed

Conversation

jkbradley
Copy link
Member

Ready for review!

Since the original PR, I moved the code to the spark.ml API and renamed this to VectorIndexer.

This introduces a VectorIndexer class which does the following:

  • VectorIndexer.fit(): collect statistics about how many values each feature in a dataset (RDD[Vector]) can take (limited by maxCategories)
    • Feature which exceed maxCategories are declared continuous, and the Model will treat them as such.
  • VectorIndexerModel.transform(): Convert categorical feature values to corresponding 0-based indices

Design notes:

  • This maintains sparsity in vectors by ensuring that categorical feature value 0.0 gets index 0.
  • This does not yet support transforming data with new (unknown) categorical feature values. That can be added later.
  • This is necessary for DecisionTree and tree ensembles.

Reviewers: Please check my use of metadata and my unit tests for it; I'm not sure if I covered everything in the tests.

Other notes:

  • This also adds a public toMetadata method to AttributeGroup (for simpler construction of metadata).

CC: @mengxr

@SparkQA
Copy link

SparkQA commented Oct 29, 2014

Test build #22463 has finished for PR 3000 at commit fc781bd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DatasetIndexer(

*
* This helps process a dataset of unknown vectors into a dataset with some continuous features
* and some categorical features. The choice between continuous and categorical is based upon
* a maxCategories parameter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxCategories as a threshold is a good default. In the future, we may want to add different criteria for some features. Thoughts? Moreover, should we have a reasonable default value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree about adding more criteria and options later on.
For a default value, does 16 seem reasonable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I was going to suggest 32 but 16 is reasonable as well. I think R only supports up to 32 for categorical features so that was my motivation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, 32 sounds good

@manishamde
Copy link
Contributor

How about the transformation for labels? This will help with transformations for classification especially from +1/-1 to 0/1 labeling for binary classification.

@jkbradley
Copy link
Member Author

Good point; I intended this to be used for labels too, so I'll add fit() and transform() methods which take RDD[Double]. Perhaps I should relabel "features" to "columns." I'd imagine someone either using 2 indexers (1 for labels and 1 for features), or zipping the labels and features into 1 vector and then using 1 indexer. We could also add other fit() and transform() methods later on to prevent users from having to do the zipping manually.

@manishamde
Copy link
Contributor

Agree.

*
* @param data Dataset with equal-length vectors.
* NOTE: A single instance of [[DatasetIndexer]] must always be given vectors of
* the same length. If given non-matching vectors, this method will throw an error.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: extra space.

@jkbradley
Copy link
Member Author

@manishamde Thanks for the feedback! I realized I can't really include a fit(RDD[Double]) method since it conflicts with fit(RDD[Vector]). This is because erasure strips away the Double/Vector to get type fit(RDD[_]). I instead included a note about mapping to Vector.

I believe the PR is ready. I removed the non-implemented parameter for unrecognized categories, to be added later.

@manishamde
Copy link
Contributor

Cool. I will make another pass shortly.

@SparkQA
Copy link

SparkQA commented Oct 31, 2014

Test build #22642 has finished for PR 3000 at commit 831aa92.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DatasetIndexer(val maxCategories: Int) extends Logging with Serializable

@manishamde
Copy link
Contributor

LGTM.

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #507 has finished for PR 3000 at commit 831aa92.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • class DatasetIndexer(val maxCategories: Int) extends Logging with Serializable

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #22787 has finished for PR 3000 at commit ee495e4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22890 has finished for PR 3000 at commit 0d947cb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DatasetIndexer(val maxCategories: Int) extends Logging with Serializable
    • class RDDFunctions[T: ClassTag](self: RDD[T]) extends Serializable

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22901 has finished for PR 3000 at commit aed6bb3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DatasetIndexer(val maxCategories: Int) extends Logging with Serializable

@sryza
Copy link
Contributor

sryza commented Nov 11, 2014

Just noticed this. I'd been working on something similar a little while ago on SPARK-1216 / #304. One difference is that I had aimed to accept categorical features that are strings, as input data commonly comes this way. Do you think that functionality should come here or in a separate PR?

@jkbradley
Copy link
Member Author

@sryza Hi, yes, I didn't realize that they shared some functionality. It would be great to coordinate. I think these 2 types of feature transformations are pretty different, but there is some shared underlying functionality.
Feature operations:

  • Decide which features should be categorical (this PR)
  • Relabel categorical feature values based on an index (this PR)
  • Create new features by expanding a categorical feature (your PR)
  • Count statistics about dataset columns (both PRs)
    The first 3 operations seem fairly distinct to me. But the last one (which does not really need to be exposed to users) could definitely be shared.

We both need to know how many distinct values there are in a column, with some extra options. (You need to specify a subset of columns, and I need to limit the number of distinct values at some point.) Perhaps we could combine these into some sort of stats collector (maybe private[mllib] for now?) which we can both use. I'd be happy to do that, or let me know if you'd like to.

@sryza
Copy link
Contributor

sryza commented Feb 12, 2015

@jkbradley sorry for the delay in responding here. Your breakdown of operations makes sense to me.

A stats collector seems like a good idea. I also wonder if there's some way to hook it in with Hive table statistics so we can avoid a pass over the data, but maybe that should be saved for future work. If you aren't planning to get to this in the near future, but think you'll have bandwidth to review, I'd be happy to work on it. Otherwise, I'm happy to look over whatever you put up.

@jkbradley
Copy link
Member Author

@sryza Thanks for offering! That would be great if you have the bandwidth to work on this. I'd be happy to help review.

One comment: It would be nice to be able to take advantage of FeatureAttributes in the spark.ml package, but that's a WIP right now: [https://github.com//pull/4460]

@jkbradley jkbradley changed the title [SPARK-4081] [mllib] DatasetIndexer [SPARK-4081] [mllib] VectorIndexer Apr 3, 2015
@jkbradley jkbradley changed the title [SPARK-4081] [mllib] VectorIndexer [WIP] [SPARK-4081] [mllib] VectorIndexer Apr 3, 2015
@SparkQA
Copy link

SparkQA commented Apr 3, 2015

Test build #29694 has finished for PR 3000 at commit 286d221.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class VectorIndexer extends Estimator[VectorIndexerModel] with VectorIndexerParams
  • This patch does not change any dependencies.

@jkbradley jkbradley changed the title [WIP] [SPARK-4081] [mllib] VectorIndexer [SPARK-4081] [mllib] VectorIndexer Apr 10, 2015
@jkbradley
Copy link
Member Author

I think it's ready now. I'll add a quick Java unit test soon to make sure getters/setters work correctly.

@jkbradley
Copy link
Member Author

I feel like the code could be a bit shorter; I'll think about that more tomorrow and whether we can make working with DataFrames and metadata easier in general.

@SparkQA
Copy link

SparkQA commented Apr 10, 2015

Test build #30002 has finished for PR 3000 at commit 02236c3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class VectorIndexer extends Estimator[VectorIndexerModel] with VectorIndexerParams
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented Apr 10, 2015

Test build #30003 has finished for PR 3000 at commit 643b444.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class VectorIndexer extends Estimator[VectorIndexerModel] with VectorIndexerParams
  • This patch does not change any dependencies.

@mengxr
Copy link
Contributor

mengxr commented Apr 10, 2015

@jkbradley I have a question about the expected behavior. Say I have a vector column containing 2 features. One is categorical: 0, 1, 2, 3, 4, 5 and the other is continuous but only takes values from 1.0, 2.0, 4.0. Then if I set maxCategories to 10, both will be recognized as categorical and the mapping for the second feature may become something like 1.0 -> 0, 2.0 -> 1, 4.0 -> 2. Is it what we expect?

@jkbradley
Copy link
Member Author

@mengxr Yes, that's what we'd expect. Eventually, we'd want to be able to specify which features to index, either (a) via another parameter specifying specific features to index or (b) via metadata, where we do not index features which already have metadata.

require(numFeatures == actualNumFeatures, "VectorIndexerModel expected vector of length" +
s" $numFeatures but found length $actualNumFeatures")
}
dataset.withColumn(map(outputCol), newCol.as(map(outputCol), newField.metadata))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice for withColumn to take metadata. I'll make a JIRA for that.

@jkbradley
Copy link
Member Author

I did some minor cleanups, but I don't see any great places to remove code. I added a Java test suite.

@SparkQA
Copy link

SparkQA commented Apr 11, 2015

Test build #30078 has finished for PR 3000 at commit f5c57a8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class VectorIndexer extends Estimator[VectorIndexerModel] with VectorIndexerParams
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented Apr 11, 2015

Test build #30082 has finished for PR 3000 at commit 5956d91.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class VectorIndexer extends Estimator[VectorIndexerModel] with VectorIndexerParams
  • This patch does not change any dependencies.

@mengxr
Copy link
Contributor

mengxr commented Apr 13, 2015

LGTM. Merged into master. Thanks, and sorry for the long delay!

@asfgit asfgit closed this in d3792f5 Apr 13, 2015
@jkbradley jkbradley deleted the indexer branch May 4, 2015 23:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants