The Stempel Analysis plugin integrates Lucene’s Stempel analysis module for Polish into elasticsearch.
It provides high quality stemming for Polish, based on the Egothor project.
The plugin provides the polish
analyzer and the polish_stem
and polish_stop
token filters,
which are not configurable.
The polish
analyzer could be reimplemented as a custom
analyzer that can
then be extended and configured differently as follows:
PUT /stempel_example
{
"settings": {
"analysis": {
"analyzer": {
"rebuilt_stempel": {
"tokenizer": "standard",
"filter": [
"lowercase",
"polish_stop",
"polish_stem"
]
}
}
}
}
}
The polish_stop
token filter filters out Polish stopwords (polish
), and
any other custom stopwords specified by the user. This filter only supports
the predefined polish
stopwords list. If you want to use a different
predefined list, then use the
{ref}/analysis-stop-tokenfilter.html[stop
token filter] instead.
PUT /polish_stop_example
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"analyzer_with_stop": {
"tokenizer": "standard",
"filter": [
"lowercase",
"polish_stop"
]
}
},
"filter": {
"polish_stop": {
"type": "polish_stop",
"stopwords": [
"_polish_",
"jeść"
]
}
}
}
}
}
}
GET polish_stop_example/_analyze
{
"analyzer": "analyzer_with_stop",
"text": "Gdzie kucharek sześć, tam nie ma co jeść."
}
The above request returns:
{
"tokens" : [
{
"token" : "kucharek",
"start_offset" : 6,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "sześć",
"start_offset" : 15,
"end_offset" : 20,
"type" : "<ALPHANUM>",
"position" : 2
}
]
}