Skip to content

Commit aa5da3b

Browse files
Add dictionary decompounder docs #7979 (#7994) (#8749)
1 parent 221778a commit aa5da3b

File tree

2 files changed

+102
-1
lines changed

2 files changed

+102
-1
lines changed
Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
---
2+
layout: default
3+
title: Dictionary decompounder
4+
parent: Token filters
5+
nav_order: 110
6+
---
7+
8+
# Dictionary decompounder token filter
9+
10+
The `dictionary_decompounder` token filter is used to split compound words into their constituent parts based on a predefined dictionary. This filter is particularly useful for languages like German, Dutch, or Finnish, in which compound words are common, so breaking them down can improve search relevance. The `dictionary_decompounder` token filter determines whether each token (word) can be split into smaller tokens based on a list of known words. If the token can be split into known words, the filter generates the subtokens for the token.
11+
12+
## Parameters
13+
14+
The `dictionary_decompounder` token filter has the following parameters.
15+
16+
Parameter | Required/Optional | Data type | Description
17+
:--- | :--- | :--- | :---
18+
`word_list` | Required unless `word_list_path` is configured | Array of strings | The dictionary of words that the filter uses to split compound words.
19+
`word_list_path` | Required unless `word_list` is configured | String | A file path to a text file containing the dictionary words. Accepts either an absolute path or a path relative to the `config` directory. The dictionary file must be UTF-8 encoded, and each word must be listed on a separate line.
20+
`min_word_size` | Optional | Integer | The minimum length of the entire compound word that will be considered for splitting. If a compound word is shorter than this value, it is not split. Default is `5`.
21+
`min_subword_size` | Optional | Integer | The minimum length for any subword. If a subword is shorter than this value, it is not included in the output. Default is `2`.
22+
`max_subword_size` | Optional | Integer | The maximum length for any subword. If a subword is longer than this value, it is not included in the output. Default is `15`.
23+
`only_longest_match` | Optional | Boolean | If set to `true`, only the longest matching subword will be returned. Default is `false`.
24+
25+
## Example
26+
27+
The following example request creates a new index named `decompound_example` and configures an analyzer with the `dictionary_decompounder` filter:
28+
29+
```json
30+
PUT /decompound_example
31+
{
32+
"settings": {
33+
"analysis": {
34+
"filter": {
35+
"my_dictionary_decompounder": {
36+
"type": "dictionary_decompounder",
37+
"word_list": ["slow", "green", "turtle"]
38+
}
39+
},
40+
"analyzer": {
41+
"my_analyzer": {
42+
"type": "custom",
43+
"tokenizer": "standard",
44+
"filter": ["lowercase", "my_dictionary_decompounder"]
45+
}
46+
}
47+
}
48+
}
49+
}
50+
```
51+
{% include copy-curl.html %}
52+
53+
## Generated tokens
54+
55+
Use the following request to examine the tokens generated using the analyzer:
56+
57+
```json
58+
POST /decompound_example/_analyze
59+
{
60+
"analyzer": "my_analyzer",
61+
"text": "slowgreenturtleswim"
62+
}
63+
```
64+
{% include copy-curl.html %}
65+
66+
The response contains the generated tokens:
67+
68+
```json
69+
{
70+
"tokens": [
71+
{
72+
"token": "slowgreenturtleswim",
73+
"start_offset": 0,
74+
"end_offset": 19,
75+
"type": "<ALPHANUM>",
76+
"position": 0
77+
},
78+
{
79+
"token": "slow",
80+
"start_offset": 0,
81+
"end_offset": 19,
82+
"type": "<ALPHANUM>",
83+
"position": 0
84+
},
85+
{
86+
"token": "green",
87+
"start_offset": 0,
88+
"end_offset": 19,
89+
"type": "<ALPHANUM>",
90+
"position": 0
91+
},
92+
{
93+
"token": "turtle",
94+
"start_offset": 0,
95+
"end_offset": 19,
96+
"type": "<ALPHANUM>",
97+
"position": 0
98+
}
99+
]
100+
}
101+
```

_analyzers/token-filters/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ Token filter | Underlying Lucene token filter| Description
2525
[`decimal_digit`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/decimal-digit/) | [DecimalDigitFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/DecimalDigitFilter.html) | Converts all digits in the Unicode decimal number general category to basic Latin digits (0--9).
2626
[`delimited_payload`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/delimited-payload/) | [DelimitedPayloadTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/payloads/DelimitedPayloadTokenFilter.html) | Separates a token stream into tokens with corresponding payloads, based on a provided delimiter. A token consists of all characters preceding the delimiter, and a payload consists of all characters following the delimiter. For example, if the delimiter is `|`, then for the string `foo|bar`, `foo` is the token and `bar` is the payload.
2727
[`delimited_term_freq`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/delimited-term-frequency/) | [DelimitedTermFrequencyTokenFilter](https://lucene.apache.org/core/9_7_0/analysis/common/org/apache/lucene/analysis/miscellaneous/DelimitedTermFrequencyTokenFilter.html) | Separates a token stream into tokens with corresponding term frequencies, based on a provided delimiter. A token consists of all characters before the delimiter, and a term frequency is the integer after the delimiter. For example, if the delimiter is `|`, then for the string `foo|5`, `foo` is the token and `5` is the term frequency.
28-
`dictionary_decompounder` | [DictionaryCompoundWordTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html) | Decomposes compound words found in many Germanic languages.
28+
[`dictionary_decompounder`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/dictionary-decompounder/) | [DictionaryCompoundWordTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html) | Splits compound words into their constituent parts based on a predefined dictionary. Useful for many Germanic languages.
2929
`edge_ngram` | [EdgeNGramTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.html) | Tokenizes the given token into edge n-grams (n-grams that start at the beginning of the token) of lengths between `min_gram` and `max_gram`. Optionally, keeps the original token.
3030
`elision` | [ElisionFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/util/ElisionFilter.html) | Removes the specified [elisions](https://en.wikipedia.org/wiki/Elision) from the beginning of tokens. For example, changes `l'avion` (the plane) to `avion` (plane).
3131
`fingerprint` | [FingerprintFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/FingerprintFilter.html) | Sorts and deduplicates the token list and concatenates tokens into a single token.

0 commit comments

Comments
 (0)