Skip to content

Commit 01c0d49

Browse files
AntonEliatrakolchfa-awsnatebower
authored
Add hunspell token filter #8061 (#8070)
* adding hunspell token filter #8061 Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * adding dedup and example where to download files Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * Update hunspell.md Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> * Update hunspell.md Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> * updating parameter table Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * Apply suggestions from code review Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Update _analyzers/token-filters/hunspell.md Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower <nbower@amazon.com>
1 parent f98dcaf commit 01c0d49

File tree

2 files changed

+109
-1
lines changed

2 files changed

+109
-1
lines changed
Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
---
2+
layout: default
3+
title: Hunspell
4+
parent: Token filters
5+
nav_order: 160
6+
---
7+
8+
# Hunspell token filter
9+
10+
The `hunspell` token filter is used for stemming and morphological analysis of words in a specific language. This filter applies Hunspell dictionaries, which are widely used in spell checkers. It works by breaking down words into their root forms (stemming).
11+
12+
The Hunspell dictionary files are automatically loaded at startup from the `<OS_PATH_CONF>/hunspell/<locale>` directory. For example, the `en_GB` locale must have at least one `.aff` file and one or more `.dic` files in the `<OS_PATH_CONF>/hunspell/en_GB/` directory.
13+
14+
You can download these files from [LibreOffice dictionaries](https://github.com/LibreOffice/dictionaries).
15+
16+
## Parameters
17+
18+
The `hunspell` token filter can be configured with the following parameters.
19+
20+
Parameter | Required/Optional | Data type | Description
21+
:--- | :--- | :--- | :---
22+
`language/lang/locale` | At least one of the three is required | String | Specifies the language for the Hunspell dictionary.
23+
`dedup` | Optional | Boolean | Determines whether to remove multiple duplicate stemming terms for the same token. Default is `true`.
24+
`dictionary` | Optional | Array of strings | Configures the dictionary files to be used for the Hunspell dictionary. Default is all files in the `<OS_PATH_CONF>/hunspell/<locale>` directory.
25+
`longest_only` | Optional | Boolean | Specifies whether only the longest stemmed version of the token should be returned. Default is `false`.
26+
27+
## Example
28+
29+
The following example request creates a new index named `my_index` and configures an analyzer with a `hunspell` filter:
30+
31+
```json
32+
PUT /my_index
33+
{
34+
"settings": {
35+
"analysis": {
36+
"filter": {
37+
"my_hunspell_filter": {
38+
"type": "hunspell",
39+
"lang": "en_GB",
40+
"dedup": true,
41+
"longest_only": true
42+
}
43+
},
44+
"analyzer": {
45+
"my_analyzer": {
46+
"type": "custom",
47+
"tokenizer": "standard",
48+
"filter": [
49+
"lowercase",
50+
"my_hunspell_filter"
51+
]
52+
}
53+
}
54+
}
55+
}
56+
}
57+
```
58+
{% include copy-curl.html %}
59+
60+
## Generated tokens
61+
62+
Use the following request to examine the tokens generated using the analyzer:
63+
64+
```json
65+
POST /my_index/_analyze
66+
{
67+
"analyzer": "my_analyzer",
68+
"text": "the turtle moves slowly"
69+
}
70+
```
71+
{% include copy-curl.html %}
72+
73+
The response contains the generated tokens:
74+
75+
```json
76+
{
77+
"tokens": [
78+
{
79+
"token": "the",
80+
"start_offset": 0,
81+
"end_offset": 3,
82+
"type": "<ALPHANUM>",
83+
"position": 0
84+
},
85+
{
86+
"token": "turtle",
87+
"start_offset": 4,
88+
"end_offset": 10,
89+
"type": "<ALPHANUM>",
90+
"position": 1
91+
},
92+
{
93+
"token": "move",
94+
"start_offset": 11,
95+
"end_offset": 16,
96+
"type": "<ALPHANUM>",
97+
"position": 2
98+
},
99+
{
100+
"token": "slow",
101+
"start_offset": 17,
102+
"end_offset": 23,
103+
"type": "<ALPHANUM>",
104+
"position": 3
105+
}
106+
]
107+
}
108+
```

_analyzers/token-filters/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ Token filter | Underlying Lucene token filter| Description
3030
[`elision`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/elision/) | [ElisionFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/util/ElisionFilter.html) | Removes the specified [elisions](https://en.wikipedia.org/wiki/Elision) from the beginning of tokens. For example, changes `l'avion` (the plane) to `avion` (plane).
3131
[`fingerprint`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/fingerprint/) | [FingerprintFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/FingerprintFilter.html) | Sorts and deduplicates the token list and concatenates tokens into a single token.
3232
`flatten_graph` | [FlattenGraphFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/FlattenGraphFilter.html) | Flattens a token graph produced by a graph token filter, such as `synonym_graph` or `word_delimiter_graph`, making the graph suitable for indexing.
33-
`hunspell` | [HunspellStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/hunspell/HunspellStemFilter.html) | Uses [Hunspell](https://en.wikipedia.org/wiki/Hunspell) rules to stem tokens. Because Hunspell supports a word having multiple stems, this filter can emit multiple tokens for each consumed token. Requires you to configure one or more language-specific Hunspell dictionaries.
33+
[`hunspell`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/hunspell/) | [HunspellStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/hunspell/HunspellStemFilter.html) | Uses [Hunspell](https://en.wikipedia.org/wiki/Hunspell) rules to stem tokens. Because Hunspell allows a word to have multiple stems, this filter can emit multiple tokens for each consumed token. Requires the configuration of one or more language-specific Hunspell dictionaries.
3434
`hyphenation_decompounder` | [HyphenationCompoundWordTokenFilter](https://lucene.apache.org/core/9_8_0/analysis/common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilter.html) | Uses XML-based hyphenation patterns to find potential subwords in compound words and checks the subwords against the specified word list. The token output contains only the subwords found in the word list.
3535
[`keep_types`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/keep-types/) | [TypeTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/TypeTokenFilter.html) | Keeps or removes tokens of a specific type.
3636
`keep_word` | [KeepWordFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeepWordFilter.html) | Checks the tokens against the specified word list and keeps only those that are in the list.

0 commit comments

Comments
 (0)