Multiple tokens on LHS in stemmer_override rules

Without looking into internals of `stemmer_override` I assumed it works similarly to synonym token filter (and translates given mapping rules into SynonymMap in the same way), which seems not to be the case:

```
PUT test
{
  "settings": {
    "analysis": {
      "filter": {
        "synonyms": {
          "type": "synonym",
          "synonyms": [
            "reading => read",
            "swimming, swims => swim"
          ]
        },
        "stems": {
          "type": "stemmer_override",
          "rules": [
            "reading => read",
            "swimming, swims => swim"
          ]
        }
      }
    }
  }
}
```

Simple rules, with single token on LHS, work the same (so both `synonyms` and `stems` will output `read` for `reading`) but rules with multiple tokens on LHS (also known as "contraction rules") do not:

SYNONYMS
------------
```
GET test/_analyze
{
  "text": "swimming",
  "tokenizer": "standard", 
  "filter": ["synonyms"]
}
```
output:
```json
{
  "tokens": [
    {
      "token": "swim",
      "start_offset": 0,
      "end_offset": 8,
      "type": "SYNONYM",
      "position": 0
    }
  ]
}
```

STEMS
-------

```
GET test/_analyze
{
  "text": "swimming",
  "tokenizer": "standard", 
  "filter": ["stems"]
}
```
output
```json
{
  "tokens": [
    {
      "token": "swimming",
      "start_offset": 0,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 0
    }
  ]
}
```

There's of course a simple workaround for my use case (expanding contraction rules into a sequence of single token mapping rules) but the user experience is bad IMO.

Although there is no place in documentation that would mention that "contraction rules" are supported in stemmer override token filter I find this behavior confusing. I would rather prefer a verbose error at filter registration to "silent failure" at analysis time. But to be honest, I think that ideally stemmer_override should support contraction rules the same way as synonym token filter does.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multiple tokens on LHS in stemmer_override rules #56113

SYNONYMS

STEMS

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multiple tokens on LHS in stemmer_override rules #56113

Description

SYNONYMS

STEMS

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions