Skip to content

[SPARK-5886][ML] Add StringIndexer as a feature transformer #4735

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from

Conversation

mengxr
Copy link
Contributor

@mengxr mengxr commented Feb 24, 2015

This PR adds string indexer, which takes a column of string labels and outputs a double column with labels indexed by their frequency.

TODOs:

  • store feature to index map in output metadata

@SparkQA
Copy link

SparkQA commented Feb 24, 2015

Test build #27868 has finished for PR 4735 at commit d0d45f9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class LabelIndexer extends Estimator[LabelIndexerModel] with LabelIndexerBase

class LabelIndexerModel private[ml] (
override val parent: LabelIndexer,
override val fittingParamMap: ParamMap,
labels: Array[String]) extends Model[LabelIndexerModel] with LabelIndexerBase {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open-ended question: if labels changes from run to run (e.g. more labels are added) then the numbering may change. The label for 0 may be different from run to run. Is that going to be a surprise later? If the indexing is only transient and is never persisted anywhere, it doesn't matter, but, might it be? I'm afraid of reloading a model referring to label 0, 1, 2 that have entirely different meaning.

Maybe there are reasons this is not an issue, or the assumption is that the caller only appends values to the end of the labels array, ever.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The labels in a fitted model should be viewed as immutable. If there are new labels, we can train a new LabelIndexerModel. We are gonna put label names into the metadata, so we still tracks which index maps to which label.

We can add an option to LabelIndexer that allows users to specify part of the mapping. If we implement this option, then we can keep the original ordering unchanged on new datasets.

@jkbradley
Copy link
Member

Is LabelIndexer going to be different from FeatureIndexer? We will need a transformer to index features as well. I've been planning to revive my old PR for DatasetIndexer which was headed this way: [https://github.com//pull/3000], but it would be good not to duplicate efforts.

If you want to merge the 2, that would be great, or I could push an update. The main thing I liked about DatasetIndexer is that it could be used to choose which features to treat as continuous vs. categorical based on a maxCategories threshold. Choosing automatically helps users to avoid having to hand-pick columns as categorical vs. continuous.

@mengxr
Copy link
Contributor Author

mengxr commented Feb 24, 2015

The label in the name is quite general. It could be labels used in classification or just an arbitrary column with string labels. This is the same as LabelEncoder in sklearn. The LabelIndexer should also create ML attributes using the labels.

#3000 takes an RDD[Vector] and tries to decide which columns should be categorical. I don't see an overlap in terms of functionality. If the plan is to make DatasetIndexer a feature transformer that handles both string labels -> indices and numeric -> categorical, we can definitely go with this direction. Btw, having a single maxCategories might cause unexpected problems, e.g., treating categorical features as continuous. We may still need to ask users to specify which columns are categorical.

@jkbradley
Copy link
Member

LabelIndexer

Sounds OK, though I wonder if some people will think it only applies to "labels" since we use that term for prediction.

String vs. numeric

Oops, I didn't realize this was taking String. (I had only looked at the JIRA and PR descriptions, not the code.) They sound different.

@SparkQA
Copy link

SparkQA commented Mar 25, 2015

Test build #29115 has finished for PR 4735 at commit f8b30f4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class LabelIndexer extends Estimator[LabelIndexerModel] with LabelIndexerBase
    • case class Alias(child: Expression, name: String)(
    • class Column(protected[sql] val expr: Expression) extends Logging

@SparkQA
Copy link

SparkQA commented Mar 25, 2015

Test build #29127 has finished for PR 4735 at commit 457166e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class LabelIndexer extends Estimator[LabelIndexerModel] with LabelIndexerBase
    • implicit class DslSymbol(sym: Symbol) extends ImplicitAttribute

@mengxr
Copy link
Contributor Author

mengxr commented Mar 25, 2015

test this please

@SparkQA
Copy link

SparkQA commented Mar 25, 2015

Test build #29149 has finished for PR 4735 at commit 457166e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class LabelIndexer extends Estimator[LabelIndexerModel] with LabelIndexerBase

@mengxr
Copy link
Contributor Author

mengxr commented Mar 25, 2015

@srowen The output column now contains ML attributes. Do you have time to make another pass?

override def fit(dataset: DataFrame, paramMap: ParamMap): LabelIndexerModel = {
val map = this.paramMap ++ paramMap
val counts = dataset.select(map(labelCol)).map(_.getString(0)).countByValue()
val labels = counts.toSeq.sortBy(-_._2).map(_._1).toArray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of ordering by count, which requires all that counting, can labels just be ordered alphabetically? maybe I miss why it's important to sort them this way, but I'd think it's arbitrary. A deterministic function of the labels themselves might be less surprising or something later, though I can't name a specific problem it would solve, now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we need to distinct labels anyway, there is little overhead with count and sort. Couple benefits:

  1. better sparsity as there will be more zeros
  2. easy inspection as common labels are at the beginning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm because the most common label is encoded as 0.0? it feels like a funny optimization since it's not "really" a 0, and these values are handled as individual values or dense vectors mostly (right?). Is counting the whole data set by label really that cheap? it still means looking at every datum even if there are few labels. I don't strongly object or anything, this just hadn't occurred to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In vector assembler, where we merge multiple columns into a vector column, the encoded indices are stored inside a sparse vector. For multiclass labels, this doesn't help much. But for binary labels, this saves at least half of the storage.

countByValue costs about the same as distinct:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1041
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L313

@SparkQA
Copy link

SparkQA commented Apr 10, 2015

Test build #30015 has finished for PR 4735 at commit 700e70f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class StringIndexer extends Estimator[StringIndexerModel] with StringIndexerBase
    • class SparseMatrix(Matrix):
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented Apr 10, 2015

Test build #30021 has finished for PR 4735 at commit d82575f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class StringIndexer extends Estimator[StringIndexerModel] with StringIndexerBase
  • This patch does not change any dependencies.

@mengxr mengxr changed the title [SPARK-5886][ML] Add label indexer [SPARK-5886][ML] Add StringIndexer as a feature transformer Apr 13, 2015
@mengxr
Copy link
Contributor Author

mengxr commented Apr 13, 2015

Merged into master.

@asfgit asfgit closed this in 685ddcf Apr 13, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants