|
| 1 | +--- |
| 2 | +layout: default |
| 3 | +title: Fingerprint |
| 4 | +parent: Token filters |
| 5 | +nav_order: 140 |
| 6 | +--- |
| 7 | + |
| 8 | +# Fingerprint token filter |
| 9 | + |
| 10 | +The `fingerprint` token filter is used to standardize and deduplicate text. This is particularly useful when consistency in text processing is crucial. The `fingerprint` token filter achieves this by processing text using the following steps: |
| 11 | + |
| 12 | +1. **Lowercasing**: Converts all text to lowercase. |
| 13 | +2. **Splitting**: Breaks the text into tokens. |
| 14 | +3. **Sorting**: Arranges the tokens in alphabetical order. |
| 15 | +4. **Removing duplicates**: Eliminates repeated tokens. |
| 16 | +5. **Joining tokens**: Combines the tokens into a single string, typically joined by a space or another specified separator. |
| 17 | + |
| 18 | +## Parameters |
| 19 | + |
| 20 | +The `fingerprint` token filter can be configured with the following two parameters. |
| 21 | + |
| 22 | +Parameter | Required/Optional | Data type | Description |
| 23 | +:--- | :--- | :--- | :--- |
| 24 | +`max_output_size` | Optional | Integer | Limits the length of the generated fingerprint string. If the concatenated string exceeds the `max_output_size`, the filter will not produce any output, resulting in an empty token. Default is `255`. |
| 25 | +`separator` | Optional | String | Defines the character(s) used to join the tokens into a single string after they have been sorted and deduplicated. Default is space (`" "`). |
| 26 | + |
| 27 | +## Example |
| 28 | + |
| 29 | +The following example request creates a new index named `my_index` and configures an analyzer with a `fingerprint` token filter: |
| 30 | + |
| 31 | +```json |
| 32 | +PUT /my_index |
| 33 | +{ |
| 34 | + "settings": { |
| 35 | + "analysis": { |
| 36 | + "filter": { |
| 37 | + "my_fingerprint": { |
| 38 | + "type": "fingerprint", |
| 39 | + "max_output_size": 200, |
| 40 | + "separator": "-" |
| 41 | + } |
| 42 | + }, |
| 43 | + "analyzer": { |
| 44 | + "my_analyzer": { |
| 45 | + "type": "custom", |
| 46 | + "tokenizer": "standard", |
| 47 | + "filter": [ |
| 48 | + "lowercase", |
| 49 | + "my_fingerprint" |
| 50 | + ] |
| 51 | + } |
| 52 | + } |
| 53 | + } |
| 54 | + } |
| 55 | +} |
| 56 | +``` |
| 57 | +{% include copy-curl.html %} |
| 58 | + |
| 59 | +## Generated tokens |
| 60 | + |
| 61 | +Use the following request to examine the tokens generated using the analyzer: |
| 62 | + |
| 63 | +```json |
| 64 | +POST /my_index/_analyze |
| 65 | +{ |
| 66 | + "analyzer": "my_analyzer", |
| 67 | + "text": "OpenSearch is a powerful search engine that scales easily" |
| 68 | +} |
| 69 | +``` |
| 70 | +{% include copy-curl.html %} |
| 71 | + |
| 72 | +The response contains the generated tokens: |
| 73 | + |
| 74 | +```json |
| 75 | +{ |
| 76 | + "tokens": [ |
| 77 | + { |
| 78 | + "token": "a-easily-engine-is-opensearch-powerful-scales-search-that", |
| 79 | + "start_offset": 0, |
| 80 | + "end_offset": 57, |
| 81 | + "type": "fingerprint", |
| 82 | + "position": 0 |
| 83 | + } |
| 84 | + ] |
| 85 | +} |
| 86 | +``` |
0 commit comments