Skip to content

Commit 8c63d73

Browse files
AntonEliatrakolchfa-awsnatebower
authored andcommitted
Add elision token filter docs opensearch-project#7981 (opensearch-project#8026)
* adding elision token filter docs opensearch-project#7981 Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * Update elision.md Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> * Update elision.md Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> * updating parameter table Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * Apply suggestions from code review Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Update _analyzers/token-filters/elision.md Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --------- Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: Eric Pugh <epugh@opensourceconnections.com>
1 parent 388c78d commit 8c63d73

File tree

2 files changed

+125
-1
lines changed

2 files changed

+125
-1
lines changed
Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
---
2+
layout: default
3+
title: Elision
4+
parent: Token filters
5+
nav_order: 130
6+
---
7+
8+
# Elision token filter
9+
10+
The `elision` token filter is used to remove elided characters from words in certain languages. Elision typically occurs in languages such as French, in which words are often contracted and combined with the following word, typically by omitting a vowel and replacing it with an apostrophe.
11+
12+
The `elision` token filter is already preconfigured in the following [language analyzers]({{site.url}}{{site.baseurl}}/analyzers/language-analyzers/): `catalan`, `french`, `irish`, and `italian`.
13+
{: .note}
14+
15+
## Parameters
16+
17+
The custom `elision` token filter can be configured with the following parameters.
18+
19+
Parameter | Required/Optional | Data type | Description
20+
:--- | :--- | :--- | :---
21+
`articles` | Required if `articles_path` is not configured | Array of strings | Defines which articles or short words should be removed when they appear as part of an elision.
22+
`articles_path` | Required if `articles` is not configured | String | Specifies the path to a custom list of articles that should be removed during the analysis process.
23+
`articles_case` | Optional | Boolean | Specifies whether the filter is case sensitive when matching elisions. Default is `false`.
24+
25+
## Example
26+
27+
The default set of French elisions is `l'`, `m'`, `t'`, `qu'`, `n'`, `s'`, `j'`, `d'`, `c'`, `jusqu'`, `quoiqu'`, `lorsqu'`, and `puisqu'`. You can update this by configuring the `french_elision` token filter. The following example request creates a new index named `french_texts` and configures an analyzer with the `french_elision` filter:
28+
29+
```json
30+
PUT /french_texts
31+
{
32+
"settings": {
33+
"analysis": {
34+
"filter": {
35+
"french_elision": {
36+
"type": "elision",
37+
"articles": [ "l", "t", "m", "d", "n", "s", "j" ]
38+
}
39+
},
40+
"analyzer": {
41+
"french_analyzer": {
42+
"type": "custom",
43+
"tokenizer": "standard",
44+
"filter": ["lowercase", "french_elision"]
45+
}
46+
}
47+
}
48+
},
49+
"mappings": {
50+
"properties": {
51+
"text": {
52+
"type": "text",
53+
"analyzer": "french_analyzer"
54+
}
55+
}
56+
}
57+
}
58+
59+
```
60+
{% include copy-curl.html %}
61+
62+
## Generated tokens
63+
64+
Use the following request to examine the tokens generated using the analyzer:
65+
66+
```json
67+
POST /french_texts/_analyze
68+
{
69+
"analyzer": "french_analyzer",
70+
"text": "L'étudiant aime l'école et le travail."
71+
}
72+
```
73+
{% include copy-curl.html %}
74+
75+
The response contains the generated tokens:
76+
77+
```json
78+
{
79+
"tokens": [
80+
{
81+
"token": "étudiant",
82+
"start_offset": 0,
83+
"end_offset": 10,
84+
"type": "<ALPHANUM>",
85+
"position": 0
86+
},
87+
{
88+
"token": "aime",
89+
"start_offset": 11,
90+
"end_offset": 15,
91+
"type": "<ALPHANUM>",
92+
"position": 1
93+
},
94+
{
95+
"token": "école",
96+
"start_offset": 16,
97+
"end_offset": 23,
98+
"type": "<ALPHANUM>",
99+
"position": 2
100+
},
101+
{
102+
"token": "et",
103+
"start_offset": 24,
104+
"end_offset": 26,
105+
"type": "<ALPHANUM>",
106+
"position": 3
107+
},
108+
{
109+
"token": "le",
110+
"start_offset": 27,
111+
"end_offset": 29,
112+
"type": "<ALPHANUM>",
113+
"position": 4
114+
},
115+
{
116+
"token": "travail",
117+
"start_offset": 30,
118+
"end_offset": 37,
119+
"type": "<ALPHANUM>",
120+
"position": 5
121+
}
122+
]
123+
}
124+
```

_analyzers/token-filters/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ Token filter | Underlying Lucene token filter| Description
2727
[`delimited_term_freq`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/delimited-term-frequency/) | [DelimitedTermFrequencyTokenFilter](https://lucene.apache.org/core/9_7_0/analysis/common/org/apache/lucene/analysis/miscellaneous/DelimitedTermFrequencyTokenFilter.html) | Separates a token stream into tokens with corresponding term frequencies, based on a provided delimiter. A token consists of all characters before the delimiter, and a term frequency is the integer after the delimiter. For example, if the delimiter is `|`, then for the string `foo|5`, `foo` is the token and `5` is the term frequency.
2828
[`dictionary_decompounder`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/dictionary-decompounder/) | [DictionaryCompoundWordTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html) | Decomposes compound words found in many Germanic languages.
2929
[`edge_ngram`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/edge-ngram/) | [EdgeNGramTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.html) | Tokenizes the given token into edge n-grams (n-grams that start at the beginning of the token) of lengths between `min_gram` and `max_gram`. Optionally, keeps the original token.
30-
`elision` | [ElisionFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/util/ElisionFilter.html) | Removes the specified [elisions](https://en.wikipedia.org/wiki/Elision) from the beginning of tokens. For example, changes `l'avion` (the plane) to `avion` (plane).
30+
[`elision`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/elision/) | [ElisionFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/util/ElisionFilter.html) | Removes the specified [elisions](https://en.wikipedia.org/wiki/Elision) from the beginning of tokens. For example, changes `l'avion` (the plane) to `avion` (plane).
3131
[`fingerprint`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/fingerprint/) | [FingerprintFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/FingerprintFilter.html) | Sorts and deduplicates the token list and concatenates tokens into a single token.
3232
`flatten_graph` | [FlattenGraphFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/FlattenGraphFilter.html) | Flattens a token graph produced by a graph token filter, such as `synonym_graph` or `word_delimiter_graph`, making the graph suitable for indexing.
3333
`hunspell` | [HunspellStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/hunspell/HunspellStemFilter.html) | Uses [Hunspell](https://en.wikipedia.org/wiki/Hunspell) rules to stem tokens. Because Hunspell supports a word having multiple stems, this filter can emit multiple tokens for each consumed token. Requires you to configure one or more language-specific Hunspell dictionaries.

0 commit comments

Comments
 (0)