Skip to content

_analyze api does not correctly use normalizers when specified #48650

Closed
@dougnelas

Description

@dougnelas

7.3.1 (bin/elasticsearch --version):

Plugins installed: []

Embedded Java 11 (java -version):

OS version (uname -a if on a Unix-like system):

Description: When using the _analyze api endpoint on an index with normalizer defined in the index settings. The output is coming from the default analyzer instead of the normalizer under test,

Steps to reproduce:

Create a test index

PUT word_delimiter_test
{
  "settings": {
    "analysis": {
      "char_filter": {
        "filter_noisy_characters": {
          "pattern": "[.-:\"]",
          "type": "pattern_replace",
          "replacement": " "
        },
        "convert_dots": {
          "flags": "CASE_INSENSITIVE",
          "pattern": "\\.(net|js|io)",
          "type": "pattern_replace",
          "replacement": "dot$1"
        }
      },
      "filter": {
        "word_delimiter": {
          "split_on_numerics": true,
          "generate_word_parts": true,
          "generate_number_parts": true,
          "catenate_all": true,
          "type": "word_delimiter_graph",
          "type_table": [
            "# => ALPHA",
            "+ => ALPHA"
          ]
        },
        "synonym": {
          "type": "synonym_graph",
          "synonyms": [
            "casp, comptia advanced security practitioner"
          ]
        }
      },
      "analyzer": {
        "test": {
          "char_filter": [
            "convert_dots"
          ],
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "synonym",
            "word_delimiter",
            "flatten_graph"
          ]
        }
      },
      "normalizer": {
        "languages_normalizer": {
          "filter": [
            "trim"
          ],
          "type": "custom",
          "char_filter": [
            "convert_dots",
            "filter_noisy_characters"
          ]
        }
      }
    }
  }
}```

Then run an _analyze endpoint to test the normalizer

GET word_delimiter_test/_analyze
{
  "text": "Wi-fi",
  "normalizer": "languages_normalizer"
}
Expected output should be "wifi"

But the output is analyzed

```json
{
  "tokens" : [
    {
      "token" : "wi",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "fi",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions