|  | 
|  | 1 | +--- | 
|  | 2 | +layout: default | 
|  | 3 | +title: Dictionary decompounder | 
|  | 4 | +parent: Token filters | 
|  | 5 | +nav_order: 110 | 
|  | 6 | +--- | 
|  | 7 | + | 
|  | 8 | +# Dictionary decompounder token filter | 
|  | 9 | + | 
|  | 10 | +The `dictionary_decompounder` token filter is used to split compound words into their constituent parts based on a predefined dictionary. This filter is particularly useful for languages like German, Dutch, or Finnish, in which compound words are common, so breaking them down can improve search relevance. The `dictionary_decompounder` token filter determines whether each token (word) can be split into smaller tokens based on a list of known words. If the token can be split into known words, the filter generates the subtokens for the token. | 
|  | 11 | + | 
|  | 12 | +## Parameters | 
|  | 13 | + | 
|  | 14 | +The `dictionary_decompounder` token filter has the following parameters. | 
|  | 15 | + | 
|  | 16 | +Parameter | Required/Optional | Data type | Description | 
|  | 17 | +:--- | :--- | :--- | :---  | 
|  | 18 | +`word_list` | Required unless `word_list_path` is configured | Array of strings | The dictionary of words that the filter uses to split compound words. | 
|  | 19 | +`word_list_path` | Required unless `word_list` is configured | String | A file path to a text file containing the dictionary words. Accepts either an absolute path or a path relative to the `config` directory. The dictionary file must be UTF-8 encoded, and each word must be listed on a separate line. | 
|  | 20 | +`min_word_size` | Optional | Integer | The minimum length of the entire compound word that will be considered for splitting. If a compound word is shorter than this value, it is not split. Default is `5`. | 
|  | 21 | +`min_subword_size` | Optional | Integer | The minimum length for any subword. If a subword is shorter than this value, it is not included in the output. Default is `2`. | 
|  | 22 | +`max_subword_size` | Optional | Integer | The maximum length for any subword. If a subword is longer than this value, it is not included in the output. Default is `15`. | 
|  | 23 | +`only_longest_match` | Optional | Boolean | If set to `true`, only the longest matching subword will be returned. Default is `false`. | 
|  | 24 | + | 
|  | 25 | +## Example | 
|  | 26 | + | 
|  | 27 | +The following example request creates a new index named `decompound_example` and configures an analyzer with the `dictionary_decompounder` filter: | 
|  | 28 | + | 
|  | 29 | +```json | 
|  | 30 | +PUT /decompound_example | 
|  | 31 | +{ | 
|  | 32 | +  "settings": { | 
|  | 33 | +    "analysis": { | 
|  | 34 | +      "filter": { | 
|  | 35 | +        "my_dictionary_decompounder": { | 
|  | 36 | +          "type": "dictionary_decompounder", | 
|  | 37 | +          "word_list": ["slow", "green", "turtle"] | 
|  | 38 | +        } | 
|  | 39 | +      }, | 
|  | 40 | +      "analyzer": { | 
|  | 41 | +        "my_analyzer": { | 
|  | 42 | +          "type": "custom", | 
|  | 43 | +          "tokenizer": "standard", | 
|  | 44 | +          "filter": ["lowercase", "my_dictionary_decompounder"] | 
|  | 45 | +        } | 
|  | 46 | +      } | 
|  | 47 | +    } | 
|  | 48 | +  } | 
|  | 49 | +} | 
|  | 50 | +``` | 
|  | 51 | +{% include copy-curl.html %} | 
|  | 52 | + | 
|  | 53 | +## Generated tokens | 
|  | 54 | + | 
|  | 55 | +Use the following request to examine the tokens generated using the analyzer: | 
|  | 56 | + | 
|  | 57 | +```json | 
|  | 58 | +POST /decompound_example/_analyze | 
|  | 59 | +{ | 
|  | 60 | +  "analyzer": "my_analyzer", | 
|  | 61 | +  "text": "slowgreenturtleswim" | 
|  | 62 | +} | 
|  | 63 | +``` | 
|  | 64 | +{% include copy-curl.html %} | 
|  | 65 | + | 
|  | 66 | +The response contains the generated tokens: | 
|  | 67 | + | 
|  | 68 | +```json | 
|  | 69 | +{ | 
|  | 70 | +  "tokens": [ | 
|  | 71 | +    { | 
|  | 72 | +      "token": "slowgreenturtleswim", | 
|  | 73 | +      "start_offset": 0, | 
|  | 74 | +      "end_offset": 19, | 
|  | 75 | +      "type": "<ALPHANUM>", | 
|  | 76 | +      "position": 0 | 
|  | 77 | +    }, | 
|  | 78 | +    { | 
|  | 79 | +      "token": "slow", | 
|  | 80 | +      "start_offset": 0, | 
|  | 81 | +      "end_offset": 19, | 
|  | 82 | +      "type": "<ALPHANUM>", | 
|  | 83 | +      "position": 0 | 
|  | 84 | +    }, | 
|  | 85 | +    { | 
|  | 86 | +      "token": "green", | 
|  | 87 | +      "start_offset": 0, | 
|  | 88 | +      "end_offset": 19, | 
|  | 89 | +      "type": "<ALPHANUM>", | 
|  | 90 | +      "position": 0 | 
|  | 91 | +    }, | 
|  | 92 | +    { | 
|  | 93 | +      "token": "turtle", | 
|  | 94 | +      "start_offset": 0, | 
|  | 95 | +      "end_offset": 19, | 
|  | 96 | +      "type": "<ALPHANUM>", | 
|  | 97 | +      "position": 0 | 
|  | 98 | +    } | 
|  | 99 | +  ] | 
|  | 100 | +} | 
|  | 101 | +``` | 
0 commit comments