Skip to content

analysis-stempel incorrect tokens generation for numbers [LUCENE-10290] #11326

@asfimport

Description

@asfimport

{}Actual{}:
I observed unexpected behaviour. Some numbers are affected by stemmer. It causes wrong search results.
For example "2021" -> "20ć".

{}Expected{}:
string numbers should not be changed.

{}Reproduce{}:

Issue can be reproduced with elasticsearch:

request:

POST _analyze
{
  "tokenizer": "standard",
  "filter": ["polish_stem"],
  "text": "2021"
}

response:

{
  "tokens": [
    {
      "token": "20ć",
      "start_offset": 0,
      "end_offset": 4,
      "type": "<NUM>",
      "position": 0
    }
  ]
}

I suspect the newer versions are also affected, but I don't have possibility to verify it.


Migrated from LUCENE-10290 by Dominik
Environment:

**Elasticsearch version** 7.11.2:

**Plugins installed**: [analysis-stempel]

**OS version** CentOS

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions