Skip to content

Latest commit

 

History

History
115 lines (101 loc) · 2.79 KB

analysis-stempel.asciidoc

File metadata and controls

115 lines (101 loc) · 2.79 KB

Stempel Polish Analysis Plugin

The Stempel Analysis plugin integrates Lucene’s Stempel analysis module for Polish into elasticsearch.

It provides high quality stemming for Polish, based on the Egothor project.

stempel tokenizer and token filters

The plugin provides the polish analyzer and the polish_stem and polish_stop token filters, which are not configurable.

Reimplementing and extending the analyzers

The polish analyzer could be reimplemented as a custom analyzer that can then be extended and configured differently as follows:

PUT /stempel_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuilt_stempel": {
          "tokenizer":  "standard",
          "filter": [
            "lowercase",
            "polish_stop",
            "polish_stem"
          ]
        }
      }
    }
  }
}

polish_stop token filter

The polish_stop token filter filters out Polish stopwords (polish), and any other custom stopwords specified by the user. This filter only supports the predefined polish stopwords list. If you want to use a different predefined list, then use the {ref}/analysis-stop-tokenfilter.html[stop token filter] instead.

PUT /polish_stop_example
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "analyzer_with_stop": {
            "tokenizer": "standard",
            "filter": [
              "lowercase",
              "polish_stop"
            ]
          }
        },
        "filter": {
          "polish_stop": {
            "type": "polish_stop",
            "stopwords": [
              "_polish_",
              "jeść"
            ]
          }
        }
      }
    }
  }
}

GET polish_stop_example/_analyze
{
  "analyzer": "analyzer_with_stop",
  "text": "Gdzie kucharek sześć, tam nie ma co jeść."
}

The above request returns:

{
  "tokens" : [
    {
      "token" : "kucharek",
      "start_offset" : 6,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "sześć",
      "start_offset" : 15,
      "end_offset" : 20,
      "type" : "<ALPHANUM>",
      "position" : 2
    }
  ]
}