Skip to content

Commit aea676c

Browse files
BenFradetjkbradley
authored andcommitted
[SPARK-12217][ML] Document invalid handling for StringIndexer
Added a paragraph regarding StringIndexer#setHandleInvalid to the ml-features documentation. I wonder if I should also add a snippet to the code example, input welcome. Author: BenFradet <benjamin.fradet@gmail.com> Closes apache#10257 from BenFradet/SPARK-12217.
1 parent 1b82203 commit aea676c

File tree

1 file changed

+36
-0
lines changed

1 file changed

+36
-0
lines changed

docs/ml-features.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -459,6 +459,42 @@ column, we should get the following:
459459
"a" gets index `0` because it is the most frequent, followed by "c" with index `1` and "b" with
460460
index `2`.
461461

462+
Additionaly, there are two strategies regarding how `StringIndexer` will handle
463+
unseen labels when you have fit a `StringIndexer` on one dataset and then use it
464+
to transform another:
465+
466+
- throw an exception (which is the default)
467+
- skip the row containing the unseen label entirely
468+
469+
**Examples**
470+
471+
Let's go back to our previous example but this time reuse our previously defined
472+
`StringIndexer` on the following dataset:
473+
474+
~~~~
475+
id | category
476+
----|----------
477+
0 | a
478+
1 | b
479+
2 | c
480+
3 | d
481+
~~~~
482+
483+
If you've not set how `StringIndexer` handles unseen labels or set it to
484+
"error", an exception will be thrown.
485+
However, if you had called `setHandleInvalid("skip")`, the following dataset
486+
will be generated:
487+
488+
~~~~
489+
id | category | categoryIndex
490+
----|----------|---------------
491+
0 | a | 0.0
492+
1 | b | 2.0
493+
2 | c | 1.0
494+
~~~~
495+
496+
Notice that the row containing "d" does not appear.
497+
462498
<div class="codetabs">
463499

464500
<div data-lang="scala" markdown="1">

0 commit comments

Comments
 (0)