Skip to content

Commit 9a52a36

Browse files
github-actions[bot]kolchfa-awsnatebower
committed
Add keep words token filter docs #8064 (#8124)
* adding keep words token filter docs #8064 Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * fixing vale errors Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * Update keep-words.md Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> * updating parameter table Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * Update keep-words.md Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> * Apply suggestions from code review Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> * Apply suggestions from code review Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower <nbower@amazon.com> (cherry picked from commit 1bb7f3e) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
1 parent 7ccdb50 commit 9a52a36

File tree

2 files changed

+93
-1
lines changed

2 files changed

+93
-1
lines changed

_analyzers/token-filters/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ Token filter | Underlying Lucene token filter| Description
3333
[`hunspell`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/hunspell/) | [HunspellStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/hunspell/HunspellStemFilter.html) | Uses [Hunspell](https://en.wikipedia.org/wiki/Hunspell) rules to stem tokens. Because Hunspell allows a word to have multiple stems, this filter can emit multiple tokens for each consumed token. Requires the configuration of one or more language-specific Hunspell dictionaries.
3434
`hyphenation_decompounder` | [HyphenationCompoundWordTokenFilter](https://lucene.apache.org/core/9_8_0/analysis/common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilter.html) | Uses XML-based hyphenation patterns to find potential subwords in compound words and checks the subwords against the specified word list. The token output contains only the subwords found in the word list.
3535
[`keep_types`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/keep-types/) | [TypeTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/TypeTokenFilter.html) | Keeps or removes tokens of a specific type.
36-
`keep_word` | [KeepWordFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeepWordFilter.html) | Checks the tokens against the specified word list and keeps only those that are in the list.
36+
[`keep_words`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/keep-words/) | [KeepWordFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeepWordFilter.html) | Checks the tokens against the specified word list and keeps only those that are in the list.
3737
`keyword_marker` | [KeywordMarkerFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeywordMarkerFilter.html) | Marks specified tokens as keywords, preventing them from being stemmed.
3838
`keyword_repeat` | [KeywordRepeatFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeywordRepeatFilter.html) | Emits each incoming token twice: once as a keyword and once as a non-keyword.
3939
`kstem` | [KStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/en/KStemFilter.html) | Provides kstem-based stemming for the English language. Combines algorithmic stemming with a built-in dictionary.
Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
---
2+
layout: default
3+
title: Keep words
4+
parent: Token filters
5+
nav_order: 190
6+
---
7+
8+
# Keep words token filter
9+
10+
The `keep_words` token filter is designed to keep only certain words during the analysis process. This filter is useful if you have a large body of text but are only interested in certain keywords or terms.
11+
12+
## Parameters
13+
14+
The `keep_words` token filter can be configured with the following parameters.
15+
16+
Parameter | Required/Optional | Data type | Description
17+
:--- | :--- | :--- | :---
18+
`keep_words` | Required if `keep_words_path` is not configured | List of strings | The list of words to keep.
19+
`keep_words_path` | Required if `keep_words` is not configured | String | The path to the file containing the list of words to keep.
20+
`keep_words_case` | Optional | Boolean | Whether to lowercase all words during comparison. Default is `false`.
21+
22+
23+
## Example
24+
25+
The following example request creates a new index named `my_index` and configures an analyzer with a `keep_words` filter:
26+
27+
```json
28+
PUT my_index
29+
{
30+
"settings": {
31+
"analysis": {
32+
"analyzer": {
33+
"custom_keep_word": {
34+
"tokenizer": "standard",
35+
"filter": [ "keep_words_filter" ]
36+
}
37+
},
38+
"filter": {
39+
"keep_words_filter": {
40+
"type": "keep",
41+
"keep_words": ["example", "world", "opensearch"],
42+
"keep_words_case": true
43+
}
44+
}
45+
}
46+
}
47+
}
48+
```
49+
{% include copy-curl.html %}
50+
51+
## Generated tokens
52+
53+
Use the following request to examine the tokens generated using the analyzer:
54+
55+
```json
56+
GET /my_index/_analyze
57+
{
58+
"analyzer": "custom_keep_word",
59+
"text": "Hello, world! This is an OpenSearch example."
60+
}
61+
```
62+
{% include copy-curl.html %}
63+
64+
The response contains the generated tokens:
65+
66+
```json
67+
{
68+
"tokens": [
69+
{
70+
"token": "world",
71+
"start_offset": 7,
72+
"end_offset": 12,
73+
"type": "<ALPHANUM>",
74+
"position": 1
75+
},
76+
{
77+
"token": "OpenSearch",
78+
"start_offset": 25,
79+
"end_offset": 35,
80+
"type": "<ALPHANUM>",
81+
"position": 5
82+
},
83+
{
84+
"token": "example",
85+
"start_offset": 36,
86+
"end_offset": 43,
87+
"type": "<ALPHANUM>",
88+
"position": 6
89+
}
90+
]
91+
}
92+
```

0 commit comments

Comments
 (0)