[SPARK-5886][ML] Add StringIndexer as a feature transformer #4735

mengxr · 2015-02-24T00:11:42Z

This PR adds string indexer, which takes a column of string labels and outputs a double column with labels indexed by their frequency.

TODOs:

store feature to index map in output metadata

SparkQA · 2015-02-24T01:25:46Z

Test build #27868 has finished for PR 4735 at commit d0d45f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class LabelIndexer extends Estimator[LabelIndexerModel] with LabelIndexerBase

srowen · 2015-02-24T10:24:48Z

mllib/src/main/scala/org/apache/spark/ml/feature/LabelIndexer.scala

+class LabelIndexerModel private[ml] (
+    override val parent: LabelIndexer,
+    override val fittingParamMap: ParamMap,
+    labels: Array[String]) extends Model[LabelIndexerModel] with LabelIndexerBase {


Open-ended question: if labels changes from run to run (e.g. more labels are added) then the numbering may change. The label for 0 may be different from run to run. Is that going to be a surprise later? If the indexing is only transient and is never persisted anywhere, it doesn't matter, but, might it be? I'm afraid of reloading a model referring to label 0, 1, 2 that have entirely different meaning.

Maybe there are reasons this is not an issue, or the assumption is that the caller only appends values to the end of the labels array, ever.

The labels in a fitted model should be viewed as immutable. If there are new labels, we can train a new LabelIndexerModel. We are gonna put label names into the metadata, so we still tracks which index maps to which label.

We can add an option to LabelIndexer that allows users to specify part of the mapping. If we implement this option, then we can keep the original ordering unchanged on new datasets.

jkbradley · 2015-02-24T23:12:10Z

Is LabelIndexer going to be different from FeatureIndexer? We will need a transformer to index features as well. I've been planning to revive my old PR for DatasetIndexer which was headed this way: [https://github.com//pull/3000], but it would be good not to duplicate efforts.

If you want to merge the 2, that would be great, or I could push an update. The main thing I liked about DatasetIndexer is that it could be used to choose which features to treat as continuous vs. categorical based on a maxCategories threshold. Choosing automatically helps users to avoid having to hand-pick columns as categorical vs. continuous.

mengxr · 2015-02-24T23:53:17Z

The label in the name is quite general. It could be labels used in classification or just an arbitrary column with string labels. This is the same as LabelEncoder in sklearn. The LabelIndexer should also create ML attributes using the labels.

#3000 takes an RDD[Vector] and tries to decide which columns should be categorical. I don't see an overlap in terms of functionality. If the plan is to make DatasetIndexer a feature transformer that handles both string labels -> indices and numeric -> categorical, we can definitely go with this direction. Btw, having a single maxCategories might cause unexpected problems, e.g., treating categorical features as continuous. We may still need to ask users to specify which columns are categorical.

jkbradley · 2015-02-25T05:00:36Z

LabelIndexer

Sounds OK, though I wonder if some people will think it only applies to "labels" since we use that term for prediction.

String vs. numeric

Oops, I didn't realize this was taking String. (I had only looked at the JIRA and PR descriptions, not the code.) They sound different.

SparkQA · 2015-03-25T00:03:17Z

Test build #29115 has finished for PR 4735 at commit f8b30f4.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class LabelIndexer extends Estimator[LabelIndexerModel] with LabelIndexerBase
- case class Alias(child: Expression, name: String)(
- class Column(protected[sql] val expr: Expression) extends Logging

SparkQA · 2015-03-25T01:14:38Z

Test build #29127 has finished for PR 4735 at commit 457166e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class LabelIndexer extends Estimator[LabelIndexerModel] with LabelIndexerBase
- implicit class DslSymbol(sym: Symbol) extends ImplicitAttribute

mengxr · 2015-03-25T05:06:00Z

test this please

SparkQA · 2015-03-25T06:28:19Z

Test build #29149 has finished for PR 4735 at commit 457166e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class LabelIndexer extends Estimator[LabelIndexerModel] with LabelIndexerBase

mengxr · 2015-03-25T17:21:14Z

@srowen The output column now contains ML attributes. Do you have time to make another pass?

srowen · 2015-03-25T17:38:32Z

mllib/src/main/scala/org/apache/spark/ml/feature/LabelIndexer.scala

+  override def fit(dataset: DataFrame, paramMap: ParamMap): LabelIndexerModel = {
+    val map = this.paramMap ++ paramMap
+    val counts = dataset.select(map(labelCol)).map(_.getString(0)).countByValue()
+    val labels = counts.toSeq.sortBy(-_._2).map(_._1).toArray


Instead of ordering by count, which requires all that counting, can labels just be ordered alphabetically? maybe I miss why it's important to sort them this way, but I'd think it's arbitrary. A deterministic function of the labels themselves might be less surprising or something later, though I can't name a specific problem it would solve, now.

Since we need to distinct labels anyway, there is little overhead with count and sort. Couple benefits:

better sparsity as there will be more zeros

easy inspection as common labels are at the beginning.

Hm because the most common label is encoded as 0.0? it feels like a funny optimization since it's not "really" a 0, and these values are handled as individual values or dense vectors mostly (right?). Is counting the whole data set by label really that cheap? it still means looking at every datum even if there are few labels. I don't strongly object or anything, this just hadn't occurred to me.

In vector assembler, where we merge multiple columns into a vector column, the encoded indices are stored inside a sparse vector. For multiclass labels, this doesn't help much. But for binary labels, this saves at least half of the storage.

countByValue costs about the same as distinct:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1041
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L313

SparkQA · 2015-04-10T07:38:23Z

Test build #30015 has finished for PR 4735 at commit 700e70f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class StringIndexer extends Estimator[StringIndexerModel] with StringIndexerBase
- class SparseMatrix(Matrix):
This patch does not change any dependencies.

SparkQA · 2015-04-10T09:25:37Z

Test build #30021 has finished for PR 4735 at commit d82575f.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class StringIndexer extends Estimator[StringIndexerModel] with StringIndexerBase
This patch does not change any dependencies.

mengxr · 2015-04-13T05:41:25Z

Merged into master.

srowen reviewed Feb 24, 2015
View reviewed changes

mengxr added 5 commits March 24, 2015 12:27

add LabelIndexer

def3c5c

add contains to OpenHashMap

748a69b

add contains to primitivekeyopenhashmap

d6e6f1f

Merge branch 'openhashmap-contains' into SPARK-5886-2

e81ec28

update label indexer to output metadata

f8b30f4

mengxr force-pushed the SPARK-5886 branch from d0d45f9 to f8b30f4 Compare March 24, 2015 22:37

Merge remote-tracking branch 'apache/master' into SPARK-5886

457166e

srowen reviewed Mar 25, 2015
View reviewed changes

mengxr added 2 commits April 9, 2015 23:19

Merge remote-tracking branch 'apache/master' into SPARK-5886

16a6f8c

rename LabelIndexer to StringIndexer

700e70f

fix test

d82575f

mengxr changed the title ~~[SPARK-5886][ML] Add label indexer~~ [SPARK-5886][ML] Add StringIndexer as a feature transformer Apr 13, 2015

asfgit closed this in 685ddcf Apr 13, 2015

[SPARK-5886][ML] Add StringIndexer as a feature transformer #4735

[SPARK-5886][ML] Add StringIndexer as a feature transformer #4735

Uh oh!

Conversation

mengxr commented Feb 24, 2015

Uh oh!

SparkQA commented Feb 24, 2015

Uh oh!

srowen Feb 24, 2015

Choose a reason for hiding this comment

Uh oh!

mengxr Feb 24, 2015

Choose a reason for hiding this comment

Uh oh!

jkbradley commented Feb 24, 2015

Uh oh!

mengxr commented Feb 24, 2015

Uh oh!

jkbradley commented Feb 25, 2015

Uh oh!

SparkQA commented Mar 25, 2015

Uh oh!

SparkQA commented Mar 25, 2015

Uh oh!

mengxr commented Mar 25, 2015

Uh oh!

SparkQA commented Mar 25, 2015

Uh oh!

mengxr commented Mar 25, 2015

Uh oh!

srowen Mar 25, 2015

Choose a reason for hiding this comment

Uh oh!

mengxr Mar 25, 2015

Choose a reason for hiding this comment

Uh oh!

srowen Mar 25, 2015

Choose a reason for hiding this comment

Uh oh!

mengxr Mar 27, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 10, 2015

Uh oh!

SparkQA commented Apr 10, 2015

Uh oh!

mengxr commented Apr 13, 2015

Uh oh!

Uh oh!