-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-5886][ML] Add StringIndexer as a feature transformer #4735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #27868 has finished for PR 4735 at commit
|
class LabelIndexerModel private[ml] ( | ||
override val parent: LabelIndexer, | ||
override val fittingParamMap: ParamMap, | ||
labels: Array[String]) extends Model[LabelIndexerModel] with LabelIndexerBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open-ended question: if labels
changes from run to run (e.g. more labels are added) then the numbering may change. The label for 0 may be different from run to run. Is that going to be a surprise later? If the indexing is only transient and is never persisted anywhere, it doesn't matter, but, might it be? I'm afraid of reloading a model referring to label 0, 1, 2 that have entirely different meaning.
Maybe there are reasons this is not an issue, or the assumption is that the caller only appends values to the end of the labels
array, ever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The labels
in a fitted model should be viewed as immutable. If there are new labels, we can train a new LabelIndexerModel
. We are gonna put label names into the metadata, so we still tracks which index maps to which label.
We can add an option to LabelIndexer
that allows users to specify part of the mapping. If we implement this option, then we can keep the original ordering unchanged on new datasets.
Is LabelIndexer going to be different from FeatureIndexer? We will need a transformer to index features as well. I've been planning to revive my old PR for DatasetIndexer which was headed this way: [https://github.com//pull/3000], but it would be good not to duplicate efforts. If you want to merge the 2, that would be great, or I could push an update. The main thing I liked about DatasetIndexer is that it could be used to choose which features to treat as continuous vs. categorical based on a maxCategories threshold. Choosing automatically helps users to avoid having to hand-pick columns as categorical vs. continuous. |
The #3000 takes an |
Sounds OK, though I wonder if some people will think it only applies to "labels" since we use that term for prediction.
Oops, I didn't realize this was taking String. (I had only looked at the JIRA and PR descriptions, not the code.) They sound different. |
Test build #29115 has finished for PR 4735 at commit
|
Test build #29127 has finished for PR 4735 at commit
|
test this please |
Test build #29149 has finished for PR 4735 at commit
|
@srowen The output column now contains ML attributes. Do you have time to make another pass? |
override def fit(dataset: DataFrame, paramMap: ParamMap): LabelIndexerModel = { | ||
val map = this.paramMap ++ paramMap | ||
val counts = dataset.select(map(labelCol)).map(_.getString(0)).countByValue() | ||
val labels = counts.toSeq.sortBy(-_._2).map(_._1).toArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of ordering by count, which requires all that counting, can labels just be ordered alphabetically? maybe I miss why it's important to sort them this way, but I'd think it's arbitrary. A deterministic function of the labels themselves might be less surprising or something later, though I can't name a specific problem it would solve, now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we need to distinct labels anyway, there is little overhead with count and sort. Couple benefits:
- better sparsity as there will be more zeros
- easy inspection as common labels are at the beginning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm because the most common label is encoded as 0.0? it feels like a funny optimization since it's not "really" a 0, and these values are handled as individual values or dense vectors mostly (right?). Is counting the whole data set by label really that cheap? it still means looking at every datum even if there are few labels. I don't strongly object or anything, this just hadn't occurred to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In vector assembler, where we merge multiple columns into a vector column, the encoded indices are stored inside a sparse vector. For multiclass labels, this doesn't help much. But for binary labels, this saves at least half of the storage.
countByValue
costs about the same as distinct
:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1041
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L313
Test build #30015 has finished for PR 4735 at commit
|
Test build #30021 has finished for PR 4735 at commit
|
Merged into master. |
This PR adds string indexer, which takes a column of string labels and outputs a double column with labels indexed by their frequency.
TODOs: