-
Notifications
You must be signed in to change notification settings - Fork 96
/
lucene-linguistics.html
275 lines (243 loc) · 8.78 KB
/
lucene-linguistics.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
---
# Copyright Vespa.ai. All rights reserved.
title: "Lucene Linguistics"
---
<p>
Lucene Linguistics is a custom <a href="linguistics.html">linguistics</a> implementation on to of the
<a href="https://lucene.apache.org">Apache Lucene</a> library.
It allows to provide a Lucene analyzer to handle text processing for a language
with an optional variation per <a href="https://github.com/vespa-engine/vespa/blob/master/linguistics/src/main/java/com/yahoo/language/process/StemMode.java">stemming mode</a>.
</p>
<p>
Check <a href="https://github.com/vespa-engine/sample-apps/tree/master/examples/lucene-linguistics">sample apps</a> to
get started.
</p>
<h2 id="crash-course-on-lucene-text-analysis">Crash course on the Lucene text analysis</h2>
<p>
A Lucene <a href="https://lucene.apache.org/core/9_8_0/core/org/apache/lucene/analysis/package-summary.html">text
analysis</a>
is a process of converting text into searchable tokens.
The text analysis consists of a series of components applied on the text in order.
The components are:
</p>
<ul>
<li>
<a href="https://lucene.apache.org/core/9_8_0/core/org/apache/lucene/analysis/CharFilter.html">CharFilters</a>:
transform the text before it is tokenized, while providing corrected character offsets to account for these
modifications.
</li>
<li><a href="https://lucene.apache.org/core/9_8_0/core/org/apache/lucene/analysis/Tokenizer.html">Tokenizers</a>:
responsible for breaking up incoming text into tokens.
</li>
<li>
<a href="https://lucene.apache.org/core/9_8_0/core/org/apache/lucene/analysis/TokenFilter.html">TokenFilters</a>:
responsible for modifying tokens that have been created by the Tokenizer.
</li>
</ul>
<p>
A specific configuration of the above components is a wrapped into an
<a href="https://lucene.apache.org/core/9_8_0/core/org/apache/lucene/analysis/Analyzer.html">Analyzer</a> object.
</p>
The text analysis works as follows:
<ol>
<li>All char filters are applied in the specified order on the entire text string</li>
<li>Token filters in the specified order are applied on each token.</li>
</ol>
<h2 id="defaults-language-analysis">Defaults language analysis</h2>
<p>
Lucene Linguistics by out-of-the-box exposes these analysis components provided
by the <a href="https://lucene.apache.org/core/9_8_0/core/index.html">lucene-core</a>
and the
<a href="https://lucene.apache.org/core/9_8_0/analysis/common/index.html">lucene-analysis-common</a>
libraries.
Other libraries with Lucene text analysis components
(e.g. <a href="https://lucene.apache.org/core/9_8_0/analysis/kuromoji/index.html">analysis-kuromoji</a>)
can be added to the application package as a maven dependency.
</p>
<p>
Lucene Linguistics out-of-the-box provides configured analyzers for 40 languages:
</p>
<ul>
<li>Arabic</li>
<li>Armenian</li>
<li>Basque</li>
<li>Bengali</li>
<li>Bulgarian</li>
<li>Catalan</li>
<li>Chinese</li>
<li>Czech</li>
<li>Danish</li>
<li>Dutch</li>
<li>English</li>
<li>Estonian</li>
<li>Finnish</li>
<li>French</li>
<li>Galician</li>
<li>German</li>
<li>Greek</li>
<li>Hindi</li>
<li>Hungarian</li>
<li>Indonesian</li>
<li>Irish</li>
<li>Italian</li>
<li>Japanese</li>
<li>Korean</li>
<li>Kurdish</li>
<li>Latvian</li>
<li>Lithuanian</li>
<li>Nepali</li>
<li>Norwegian</li>
<li>Persian</li>
<li>Portuguese</li>
<li>Romanian</li>
<li>Russian</li>
<li>Serbian</li>
<li>Spanish</li>
<li>Swedish</li>
<li>Tamil</li>
<li>Telugu</li>
<li>Thai</li>
<li>Turkish</li>
</ul>
<p>
The Lucene
<a href="https://lucene.apache.org/core/9_8_0/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html">StandardAnalyzer</a>
is used for the languages that doesn't have neither a custom nor a default analyzer.
</p>
<h2 id="linguistics-key">Linguistics key</h2>
<p>
Linguistics keys identify a configuration of text analysis.
A key has 2 parts: a mandatory <a href="https://github.com/vespa-engine/vespa/blob/master/linguistics/src/main/java/com/yahoo/language/Language.java">
language code</a> and an optional stemming mode.
The format is <code>LANGUAGE_CODE[/STEM_MODE]</code>.
There are 5 stemming modes: <code>NONE, DEFAULT, ALL, SHORTEST, BEST</code> (they can be specified in the <a href="reference/schema-reference.html#stemming">field schema</a>).
</p>
<p>Examples of linguistics key:</p>
<ul>
<li>
<code>en</code>: English language.
</li>
<li>
<code>en/BEST</code>: English language with the <code>BEST</code> stemming mode.
</li>
</ul>
<h2 id="custom-analysis">Customizing text analysis</h2>
<p>
The Lucene linguistics provides multiple ways to customize the text analysis per language:
</p>
<ul>
<li>
<code>LuceneLinguistics</code> component configuration in the <code>services.xml</code>
</li>
<li>
<code>ComponentsRegistry</code>
</li>
</ul>
<h3 id="lucene-linguistics-configuration">LuceneLinguistics component configuration</h3>
<p>
In the <code>services.xml</code> out of all text analysis components
(that are available on the classpath)
it is possible to construct an analyzer by providing
<a href="https://github.com/vespa-engine/vespa/blob/master/lucene-linguistics/src/main/resources/configdefinitions/lucene-analysis.def">configuration for the</a>
<code>LuceneLinguistics</code> component.
Example for the English language:
</p>
<pre>
<component id="linguistics"
class="com.yahoo.language.lucene.LuceneLinguistics"
bundle="your-bundle-name">
<config name="com.yahoo.language.lucene.lucene-analysis"/>
<configDir>lucene-linguistics</configDir>
<analysis>
<item key="en">
<tokenizer>
<name>standard</name>
</tokenizer>
<tokenFilters>
<item>
<name>stop</name>
<conf>
<item key="words">en/stopwords.txt</item>
<item key="ignoreCase">true</item>
</conf>
</item>
<item>
<name>englishMinimalStem</name>
</item>
</tokenFilters>
</item>
</analysis>
</component>
</pre>
<p>Notes:</p>
<ul>
<li>
<code>item key="en"</code> value is a <a href="#linguistics-key">linguistics key</a>.
</li>
<li>
the <code>en/stopwords.txt</code> file must be placed in your application package under
the <code>lucene-linguistics</code> directory.
</li>
<li>
If the <code>configDir</code> is not provided the files must be on the classpath.
</li>
</ul>
<h3 id="components-registry">Components registry</h3>
<p>
The <a href="jdisc/injecting-components.html#depending-on-all-components-of-a-specific-type">ComponentsRegistry</a>
mechanism can be used to set a Lucene Analyzer for a language.
</p>
<p>
</p>
<pre>
<component
id="en"
class="org.apache.lucene.analysis.core.SimpleAnalyzer"
bundle="your-bundle-name" />
</pre>
<p>
Where:
</p>
<ul>
<li>
<code>id</code> must be a <a href="#linguistics-key">linguistics key</a>;
</li>
<li>
<code>class</code> is the implementation class that extends the `Analyzer` class;
</li>
<li>
<code>bundle</code> is a name of the application package as specified in the <code>pom.xml</code>
(or can be any bundle added to your VAP <code>components</code> dir that contains the class).
</li>
</ul>
<p>
For this to work, the class must provide <b>only</b> a constructor without arguments.
</p>
<p>
In case your analyzer class needs some initialization you must wrap the analyzer into a class
that implements the <code>Provider<Analyzer></code>.
</p>
<h3 id="adding-custom-analysis-component">Custom text analysis components</h3>
<p>
The text analysis components are loaded via Java Service provider interface (<a
href="https://www.baeldung.com/java-spi" data-proofer-ignore>SPI</a>).
</p>
<p>
To use an external library that is properly prepared it is enough to add the
library to the application package as a maven dependency.
</p>
<p>
In case you need to create a custom components the steps are:
</p>
<ol>
<li>implement a component in a Java class</li>
<li>register the component class in the (e.g. a custom token filter) <code>META-INF/services/org.apache.lucene.analysis.TokenFilterFactory</code>
file that is on the classpath.
</li>
</ol>
<h2 id="language-detection">Language Detection</h2>
<p>
Lucene Linguistics doesn't provide language detection.
This means that for both feeding and searching you should provide a
<a href="reference/query-api-reference.html#model.language">language parameter</a>.
</p>