Skip to content

Commit 388c78d

Browse files
AntonEliatrakolchfa-awsnatebower
authored andcommitted
Add fingerprint token filter opensearch-project#7982 (opensearch-project#8059)
* adding fingerprint token filter opensearch-project#7982 Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * fixing typo Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * Update fingerprint.md Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> * updating parameter table Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * Apply suggestions from code review Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> * Update _analyzers/token-filters/fingerprint.md Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: Eric Pugh <epugh@opensourceconnections.com>
1 parent 2c1a0d4 commit 388c78d

File tree

2 files changed

+87
-1
lines changed

2 files changed

+87
-1
lines changed
Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
---
2+
layout: default
3+
title: Fingerprint
4+
parent: Token filters
5+
nav_order: 140
6+
---
7+
8+
# Fingerprint token filter
9+
10+
The `fingerprint` token filter is used to standardize and deduplicate text. This is particularly useful when consistency in text processing is crucial. The `fingerprint` token filter achieves this by processing text using the following steps:
11+
12+
1. **Lowercasing**: Converts all text to lowercase.
13+
2. **Splitting**: Breaks the text into tokens.
14+
3. **Sorting**: Arranges the tokens in alphabetical order.
15+
4. **Removing duplicates**: Eliminates repeated tokens.
16+
5. **Joining tokens**: Combines the tokens into a single string, typically joined by a space or another specified separator.
17+
18+
## Parameters
19+
20+
The `fingerprint` token filter can be configured with the following two parameters.
21+
22+
Parameter | Required/Optional | Data type | Description
23+
:--- | :--- | :--- | :---
24+
`max_output_size` | Optional | Integer | Limits the length of the generated fingerprint string. If the concatenated string exceeds the `max_output_size`, the filter will not produce any output, resulting in an empty token. Default is `255`.
25+
`separator` | Optional | String | Defines the character(s) used to join the tokens into a single string after they have been sorted and deduplicated. Default is space (`" "`).
26+
27+
## Example
28+
29+
The following example request creates a new index named `my_index` and configures an analyzer with a `fingerprint` token filter:
30+
31+
```json
32+
PUT /my_index
33+
{
34+
"settings": {
35+
"analysis": {
36+
"filter": {
37+
"my_fingerprint": {
38+
"type": "fingerprint",
39+
"max_output_size": 200,
40+
"separator": "-"
41+
}
42+
},
43+
"analyzer": {
44+
"my_analyzer": {
45+
"type": "custom",
46+
"tokenizer": "standard",
47+
"filter": [
48+
"lowercase",
49+
"my_fingerprint"
50+
]
51+
}
52+
}
53+
}
54+
}
55+
}
56+
```
57+
{% include copy-curl.html %}
58+
59+
## Generated tokens
60+
61+
Use the following request to examine the tokens generated using the analyzer:
62+
63+
```json
64+
POST /my_index/_analyze
65+
{
66+
"analyzer": "my_analyzer",
67+
"text": "OpenSearch is a powerful search engine that scales easily"
68+
}
69+
```
70+
{% include copy-curl.html %}
71+
72+
The response contains the generated tokens:
73+
74+
```json
75+
{
76+
"tokens": [
77+
{
78+
"token": "a-easily-engine-is-opensearch-powerful-scales-search-that",
79+
"start_offset": 0,
80+
"end_offset": 57,
81+
"type": "fingerprint",
82+
"position": 0
83+
}
84+
]
85+
}
86+
```

_analyzers/token-filters/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ Token filter | Underlying Lucene token filter| Description
2828
[`dictionary_decompounder`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/dictionary-decompounder/) | [DictionaryCompoundWordTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html) | Decomposes compound words found in many Germanic languages.
2929
[`edge_ngram`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/edge-ngram/) | [EdgeNGramTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.html) | Tokenizes the given token into edge n-grams (n-grams that start at the beginning of the token) of lengths between `min_gram` and `max_gram`. Optionally, keeps the original token.
3030
`elision` | [ElisionFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/util/ElisionFilter.html) | Removes the specified [elisions](https://en.wikipedia.org/wiki/Elision) from the beginning of tokens. For example, changes `l'avion` (the plane) to `avion` (plane).
31-
`fingerprint` | [FingerprintFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/FingerprintFilter.html) | Sorts and deduplicates the token list and concatenates tokens into a single token.
31+
[`fingerprint`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/fingerprint/) | [FingerprintFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/FingerprintFilter.html) | Sorts and deduplicates the token list and concatenates tokens into a single token.
3232
`flatten_graph` | [FlattenGraphFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/FlattenGraphFilter.html) | Flattens a token graph produced by a graph token filter, such as `synonym_graph` or `word_delimiter_graph`, making the graph suitable for indexing.
3333
`hunspell` | [HunspellStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/hunspell/HunspellStemFilter.html) | Uses [Hunspell](https://en.wikipedia.org/wiki/Hunspell) rules to stem tokens. Because Hunspell supports a word having multiple stems, this filter can emit multiple tokens for each consumed token. Requires you to configure one or more language-specific Hunspell dictionaries.
3434
`hyphenation_decompounder` | [HyphenationCompoundWordTokenFilter](https://lucene.apache.org/core/9_8_0/analysis/common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilter.html) | Uses XML-based hyphenation patterns to find potential subwords in compound words and checks the subwords against the specified word list. The token output contains only the subwords found in the word list.

0 commit comments

Comments
 (0)