Skip to content

Multiple tokens on LHS in stemmer_override rules #56113

Closed
@telendt

Description

@telendt

Without looking into internals of stemmer_override I assumed it works similarly to synonym token filter (and translates given mapping rules into SynonymMap in the same way), which seems not to be the case:

PUT test
{
  "settings": {
    "analysis": {
      "filter": {
        "synonyms": {
          "type": "synonym",
          "synonyms": [
            "reading => read",
            "swimming, swims => swim"
          ]
        },
        "stems": {
          "type": "stemmer_override",
          "rules": [
            "reading => read",
            "swimming, swims => swim"
          ]
        }
      }
    }
  }
}

Simple rules, with single token on LHS, work the same (so both synonyms and stems will output read for reading) but rules with multiple tokens on LHS (also known as "contraction rules") do not:

SYNONYMS

GET test/_analyze
{
  "text": "swimming",
  "tokenizer": "standard", 
  "filter": ["synonyms"]
}

output:

{
  "tokens": [
    {
      "token": "swim",
      "start_offset": 0,
      "end_offset": 8,
      "type": "SYNONYM",
      "position": 0
    }
  ]
}

STEMS

GET test/_analyze
{
  "text": "swimming",
  "tokenizer": "standard", 
  "filter": ["stems"]
}

output

{
  "tokens": [
    {
      "token": "swimming",
      "start_offset": 0,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 0
    }
  ]
}

There's of course a simple workaround for my use case (expanding contraction rules into a sequence of single token mapping rules) but the user experience is bad IMO.

Although there is no place in documentation that would mention that "contraction rules" are supported in stemmer override token filter I find this behavior confusing. I would rather prefer a verbose error at filter registration to "silent failure" at analysis time. But to be honest, I think that ideally stemmer_override should support contraction rules the same way as synonym token filter does.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions