Skip to content

Commit 0cab398

Browse files
AntonEliatrakolchfa-awsnatebower
authored andcommitted
adding language analyzers (opensearch-project#8591)
* adding arabic language analyzer Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * Add grandparent to arabic analyzer Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> * adding more details Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * adding armenian language analyzer Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * adding basque bengali and brazilian language analyzers Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * adding bulgarian catalan and cjk language analyzers Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * adding czech,danish,dutch,english,estonian,finnish,french and galician analyzer docs Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * adding german,greek,hindi,hungarian,indonesian,irish,italian,latvian,lithuanian,norwegian and persion laguage analyzer docs Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * adding portuguese,romanian,russian,sorani,spanish,swedish,thai and turkish language analyzer docs Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * Apply suggestions from code review Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> * updating as per pr review Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * fixing broken link Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * Apply suggestions from code review Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> * Update _analyzers/language-analyzers/index.md Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Add redirect to index page Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> --------- Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> Signed-off-by: Fanit Kolchina <kolchfa@amazon.com> Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Fanit Kolchina <kolchfa@amazon.com> Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: Eric Pugh <epugh@opensourceconnections.com>
1 parent d2a36b5 commit 0cab398

38 files changed

+5499
-46
lines changed

_analyzers/language-analyzers.md

Lines changed: 0 additions & 44 deletions
This file was deleted.
Lines changed: 182 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,182 @@
1+
---
2+
layout: default
3+
title: Arabic
4+
parent: Language analyzers
5+
grand_parent: Analyzers
6+
nav_order: 10
7+
---
8+
9+
# Arabic analyzer
10+
11+
The built-in `arabic` analyzer can be applied to a text field using the following command:
12+
13+
```json
14+
PUT /arabic-index
15+
{
16+
"mappings": {
17+
"properties": {
18+
"content": {
19+
"type": "text",
20+
"analyzer": "arabic"
21+
}
22+
}
23+
}
24+
}
25+
```
26+
{% include copy-curl.html %}
27+
28+
## Stem exclusion
29+
30+
You can use `stem_exclusion` with this language analyzer using the following command:
31+
32+
```json
33+
PUT index_with_stem_exclusion_arabic
34+
{
35+
"settings": {
36+
"analysis": {
37+
"analyzer": {
38+
"stem_exclusion_arabic_analyzer":{
39+
"type":"arabic",
40+
"stem_exclusion":["تكنولوجيا","سلطة "]
41+
}
42+
}
43+
}
44+
}
45+
}
46+
```
47+
{% include copy-curl.html %}
48+
49+
## Arabic analyzer internals
50+
51+
The `arabic` analyzer is built using the following components:
52+
53+
- Tokenizer: `standard`
54+
55+
- Token filters:
56+
- lowercase
57+
- decimal_digit
58+
- stop (Arabic)
59+
- normalization (Arabic)
60+
- keyword
61+
- stemmer (Arabic)
62+
63+
## Custom Arabic analyzer
64+
65+
You can create a custom Arabic analyzer using the following command:
66+
67+
```json
68+
PUT /arabic-index
69+
{
70+
"settings": {
71+
"analysis": {
72+
"filter": {
73+
"arabic_stop": {
74+
"type": "stop",
75+
"stopwords": "_arabic_"
76+
},
77+
"arabic_stemmer": {
78+
"type": "stemmer",
79+
"language": "arabic"
80+
},
81+
"arabic_normalization": {
82+
"type": "arabic_normalization"
83+
},
84+
"decimal_digit": {
85+
"type": "decimal_digit"
86+
},
87+
"arabic_keywords": {
88+
"type": "keyword_marker",
89+
"keywords": []
90+
}
91+
},
92+
"analyzer": {
93+
"arabic_analyzer": {
94+
"type": "custom",
95+
"tokenizer": "standard",
96+
"filter": [
97+
"lowercase",
98+
"arabic_normalization",
99+
"decimal_digit",
100+
"arabic_stop",
101+
"arabic_keywords",
102+
"arabic_stemmer"
103+
]
104+
}
105+
}
106+
}
107+
},
108+
"mappings": {
109+
"properties": {
110+
"content": {
111+
"type": "text",
112+
"analyzer": "arabic_analyzer"
113+
}
114+
}
115+
}
116+
}
117+
```
118+
{% include copy-curl.html %}
119+
120+
## Generated tokens
121+
122+
Use the following request to examine the tokens generated using the analyzer:
123+
124+
```json
125+
POST /arabic-index/_analyze
126+
{
127+
"field": "content",
128+
"text": "الطلاب يدرسون في الجامعات العربية. أرقامهم ١٢٣٤٥٦."
129+
}
130+
```
131+
{% include copy-curl.html %}
132+
133+
The response contains the generated tokens:
134+
135+
```json
136+
{
137+
"tokens": [
138+
{
139+
"token": "طلاب",
140+
"start_offset": 0,
141+
"end_offset": 6,
142+
"type": "<ALPHANUM>",
143+
"position": 0
144+
},
145+
{
146+
"token": "يدرس",
147+
"start_offset": 7,
148+
"end_offset": 13,
149+
"type": "<ALPHANUM>",
150+
"position": 1
151+
},
152+
{
153+
"token": "جامع",
154+
"start_offset": 17,
155+
"end_offset": 25,
156+
"type": "<ALPHANUM>",
157+
"position": 3
158+
},
159+
{
160+
"token": "عرب",
161+
"start_offset": 26,
162+
"end_offset": 33,
163+
"type": "<ALPHANUM>",
164+
"position": 4
165+
},
166+
{
167+
"token": "ارقامهم",
168+
"start_offset": 35,
169+
"end_offset": 42,
170+
"type": "<ALPHANUM>",
171+
"position": 5
172+
},
173+
{
174+
"token": "123456",
175+
"start_offset": 43,
176+
"end_offset": 49,
177+
"type": "<NUM>",
178+
"position": 6
179+
}
180+
]
181+
}
182+
```
Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
---
2+
layout: default
3+
title: Armenian
4+
parent: Language analyzers
5+
grand_parent: Analyzers
6+
nav_order: 20
7+
---
8+
9+
# Armenian analyzer
10+
11+
The built-in `armenian` analyzer can be applied to a text field using the following command:
12+
13+
```json
14+
PUT /arabic-index
15+
{
16+
"mappings": {
17+
"properties": {
18+
"content": {
19+
"type": "text",
20+
"analyzer": "armenian"
21+
}
22+
}
23+
}
24+
}
25+
```
26+
{% include copy-curl.html %}
27+
28+
## Stem exclusion
29+
30+
You can use `stem_exclusion` with this language analyzer using the following command:
31+
32+
```json
33+
PUT index_with_stem_exclusion_armenian_analyzer
34+
{
35+
"settings": {
36+
"analysis": {
37+
"analyzer": {
38+
"stem_exclusion_armenian_analyzer": {
39+
"type": "armenian",
40+
"stem_exclusion": ["բարև", "խաղաղություն"]
41+
}
42+
}
43+
}
44+
}
45+
}
46+
```
47+
{% include copy-curl.html %}
48+
49+
## Armenian analyzer internals
50+
51+
The `armenian` analyzer is built using the following components:
52+
53+
- Tokenizer: `standard`
54+
55+
- Token filters:
56+
- lowercase
57+
- stop (Armenian)
58+
- keyword
59+
- stemmer (Armenian)
60+
61+
## Custom Armenian analyzer
62+
63+
You can create a custom Armenian analyzer using the following command:
64+
65+
```json
66+
PUT /armenian-index
67+
{
68+
"settings": {
69+
"analysis": {
70+
"filter": {
71+
"armenian_stop": {
72+
"type": "stop",
73+
"stopwords": "_armenian_"
74+
},
75+
"armenian_stemmer": {
76+
"type": "stemmer",
77+
"language": "armenian"
78+
},
79+
"armenian_keywords": {
80+
"type": "keyword_marker",
81+
"keywords": []
82+
}
83+
},
84+
"analyzer": {
85+
"armenian_analyzer": {
86+
"type": "custom",
87+
"tokenizer": "standard",
88+
"filter": [
89+
"lowercase",
90+
"armenian_stop",
91+
"armenian_keywords",
92+
"armenian_stemmer"
93+
]
94+
}
95+
}
96+
}
97+
},
98+
"mappings": {
99+
"properties": {
100+
"content": {
101+
"type": "text",
102+
"analyzer": "armenian_analyzer"
103+
}
104+
}
105+
}
106+
}
107+
```
108+
{% include copy-curl.html %}
109+
110+
## Generated tokens
111+
112+
Use the following request to examine the tokens generated using the analyzer:
113+
114+
```json
115+
GET armenian-index/_analyze
116+
{
117+
"analyzer": "stem_exclusion_armenian_analyzer",
118+
"text": "բարև բոլորին, մենք խաղաղություն ենք ուզում և նոր օր ենք սկսել"
119+
}
120+
```
121+
{% include copy-curl.html %}
122+
123+
The response contains the generated tokens:
124+
125+
```json
126+
{
127+
"tokens": [
128+
{"token": "բարև","start_offset": 0,"end_offset": 4,"type": "<ALPHANUM>","position": 0},
129+
{"token": "բոլոր","start_offset": 5,"end_offset": 12,"type": "<ALPHANUM>","position": 1},
130+
{"token": "խաղաղություն","start_offset": 19,"end_offset": 31,"type": "<ALPHANUM>","position": 3},
131+
{"token": "ուզ","start_offset": 36,"end_offset": 42,"type": "<ALPHANUM>","position": 5},
132+
{"token": "նոր","start_offset": 45,"end_offset": 48,"type": "<ALPHANUM>","position": 7},
133+
{"token": "օր","start_offset": 49,"end_offset": 51,"type": "<ALPHANUM>","position": 8},
134+
{"token": "սկսել","start_offset": 56,"end_offset": 61,"type": "<ALPHANUM>","position": 10}
135+
]
136+
}
137+
```

0 commit comments

Comments
 (0)