Description
Without looking into internals of stemmer_override
I assumed it works similarly to synonym token filter (and translates given mapping rules into SynonymMap in the same way), which seems not to be the case:
PUT test
{
"settings": {
"analysis": {
"filter": {
"synonyms": {
"type": "synonym",
"synonyms": [
"reading => read",
"swimming, swims => swim"
]
},
"stems": {
"type": "stemmer_override",
"rules": [
"reading => read",
"swimming, swims => swim"
]
}
}
}
}
}
Simple rules, with single token on LHS, work the same (so both synonyms
and stems
will output read
for reading
) but rules with multiple tokens on LHS (also known as "contraction rules") do not:
SYNONYMS
GET test/_analyze
{
"text": "swimming",
"tokenizer": "standard",
"filter": ["synonyms"]
}
output:
{
"tokens": [
{
"token": "swim",
"start_offset": 0,
"end_offset": 8,
"type": "SYNONYM",
"position": 0
}
]
}
STEMS
GET test/_analyze
{
"text": "swimming",
"tokenizer": "standard",
"filter": ["stems"]
}
output
{
"tokens": [
{
"token": "swimming",
"start_offset": 0,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 0
}
]
}
There's of course a simple workaround for my use case (expanding contraction rules into a sequence of single token mapping rules) but the user experience is bad IMO.
Although there is no place in documentation that would mention that "contraction rules" are supported in stemmer override token filter I find this behavior confusing. I would rather prefer a verbose error at filter registration to "silent failure" at analysis time. But to be honest, I think that ideally stemmer_override should support contraction rules the same way as synonym token filter does.