Skip to content

Commit 8d3ec41

Browse files
AntonEliatrakolchfa-awsnatebower
authored
Add keep type docs #8063 (#8122)
* adding keep type docs #8063 Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * Update keep-types.md Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> * updating parameter table Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * Update keep-types.md Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> * Apply suggestions from code review Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> * fixing types Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * Apply suggestions from code review Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower <nbower@amazon.com>
1 parent d4bb0f5 commit 8d3ec41

File tree

2 files changed

+116
-1
lines changed

2 files changed

+116
-1
lines changed

_analyzers/token-filters/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ Token filter | Underlying Lucene token filter| Description
3232
`flatten_graph` | [FlattenGraphFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/FlattenGraphFilter.html) | Flattens a token graph produced by a graph token filter, such as `synonym_graph` or `word_delimiter_graph`, making the graph suitable for indexing.
3333
`hunspell` | [HunspellStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/hunspell/HunspellStemFilter.html) | Uses [Hunspell](https://en.wikipedia.org/wiki/Hunspell) rules to stem tokens. Because Hunspell supports a word having multiple stems, this filter can emit multiple tokens for each consumed token. Requires you to configure one or more language-specific Hunspell dictionaries.
3434
`hyphenation_decompounder` | [HyphenationCompoundWordTokenFilter](https://lucene.apache.org/core/9_8_0/analysis/common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilter.html) | Uses XML-based hyphenation patterns to find potential subwords in compound words and checks the subwords against the specified word list. The token output contains only the subwords found in the word list.
35-
`keep_types` | [TypeTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/TypeTokenFilter.html) | Keeps or removes tokens of a specific type.
35+
[`keep_types`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/keep-types/) | [TypeTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/TypeTokenFilter.html) | Keeps or removes tokens of a specific type.
3636
`keep_word` | [KeepWordFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeepWordFilter.html) | Checks the tokens against the specified word list and keeps only those that are in the list.
3737
`keyword_marker` | [KeywordMarkerFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeywordMarkerFilter.html) | Marks specified tokens as keywords, preventing them from being stemmed.
3838
`keyword_repeat` | [KeywordRepeatFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeywordRepeatFilter.html) | Emits each incoming token twice: once as a keyword and once as a non-keyword.
Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
---
2+
layout: default
3+
title: Keep types
4+
parent: Token filters
5+
nav_order: 180
6+
---
7+
8+
# Keep types token filter
9+
10+
The `keep_types` token filter is a type of token filter used in text analysis to control which token types are kept or discarded. Different tokenizers produce different token types, for example, `<HOST>`, `<NUM>`, or `<ALPHANUM>`.
11+
12+
The `keyword`, `simple_pattern`, and `simple_pattern_split` tokenizers do not support the `keep_types` token filter because these tokenizers do not support token type attributes.
13+
{: .note}
14+
15+
## Parameters
16+
17+
The `keep_types` token filter can be configured with the following parameters.
18+
19+
Parameter | Required/Optional | Data type | Description
20+
:--- | :--- | :--- | :---
21+
`types` | Required | List of strings | List of token types to be kept or discarded (determined by the `mode`).
22+
`mode`| Optional | String | Whether to `include` or `exclude` the token types specified in `types`. Default is `include`.
23+
24+
25+
## Example
26+
27+
The following example request creates a new index named `test_index` and configures an analyzer with a `keep_types` filter:
28+
29+
```json
30+
PUT /test_index
31+
{
32+
"settings": {
33+
"analysis": {
34+
"analyzer": {
35+
"custom_analyzer": {
36+
"type": "custom",
37+
"tokenizer": "standard",
38+
"filter": ["lowercase", "keep_types_filter"]
39+
}
40+
},
41+
"filter": {
42+
"keep_types_filter": {
43+
"type": "keep_types",
44+
"types": ["<ALPHANUM>"]
45+
}
46+
}
47+
}
48+
}
49+
}
50+
```
51+
{% include copy-curl.html %}
52+
53+
## Generated tokens
54+
55+
Use the following request to examine the tokens generated using the analyzer:
56+
57+
```json
58+
GET /test_index/_analyze
59+
{
60+
"analyzer": "custom_analyzer",
61+
"text": "Hello 2 world! This is an example."
62+
}
63+
```
64+
{% include copy-curl.html %}
65+
66+
The response contains the generated tokens:
67+
68+
```json
69+
{
70+
"tokens": [
71+
{
72+
"token": "hello",
73+
"start_offset": 0,
74+
"end_offset": 5,
75+
"type": "<ALPHANUM>",
76+
"position": 0
77+
},
78+
{
79+
"token": "world",
80+
"start_offset": 8,
81+
"end_offset": 13,
82+
"type": "<ALPHANUM>",
83+
"position": 2
84+
},
85+
{
86+
"token": "this",
87+
"start_offset": 15,
88+
"end_offset": 19,
89+
"type": "<ALPHANUM>",
90+
"position": 3
91+
},
92+
{
93+
"token": "is",
94+
"start_offset": 20,
95+
"end_offset": 22,
96+
"type": "<ALPHANUM>",
97+
"position": 4
98+
},
99+
{
100+
"token": "an",
101+
"start_offset": 23,
102+
"end_offset": 25,
103+
"type": "<ALPHANUM>",
104+
"position": 5
105+
},
106+
{
107+
"token": "example",
108+
"start_offset": 26,
109+
"end_offset": 33,
110+
"type": "<ALPHANUM>",
111+
"position": 6
112+
}
113+
]
114+
}
115+
```

0 commit comments

Comments
 (0)