|
1 | 1 | [[controlling-stemming]]
|
2 |
| -=== Controlling Stemming |
| 2 | +=== 控制词干提取 |
3 | 3 |
|
4 |
| -Out-of-the-box stemming solutions are never perfect.((("stemming words", "controlling stemming"))) Algorithmic stemmers, |
5 |
| -especially, will blithely apply their rules to any words they encounter, |
6 |
| -perhaps conflating words that you would prefer to keep separate. Maybe, for |
7 |
| -your use case, it is important to keep `skies` and `skiing` as distinct words |
8 |
| -rather than stemming them both down to `ski` (as would happen with the |
9 |
| -`english` analyzer). |
| 4 | +开箱即用的词干提取方案永远也不可能完美。((("stemming words", "controlling stemming"))) |
| 5 | +尤其是算法提取器,他们可以愉快的将规则应用于任何他们遇到的词,包含那些你希望保持独立的词。 |
| 6 | +也许,在你的场景,保持独立的 `skies` 和 `skiing` 是重要的,你不希望把他们提取为 `ski` (正如 `english` 分析器那样)。 |
10 | 7 |
|
11 |
| -The {ref}/analysis-keyword-marker-tokenfilter.html[`keyword_marker`] and |
12 |
| -{ref}/analysis-stemmer-override-tokenfilter.html[`stemmer_override`] token filters((("stemmer_override token filter")))((("keyword_marker token filter"))) |
13 |
| -allow us to customize the stemming process. |
| 8 | +语汇单元过滤器 {ref}/analysis-keyword-marker-tokenfilter.html[`keyword_marker`] 和 |
| 9 | +{ref}/analysis-stemmer-override-tokenfilter.html[`stemmer_override`] ((("stemmer_override token filter")))((("keyword_marker token filter"))) |
| 10 | +能让我们自定义词干提取过程。 |
14 | 11 |
|
15 | 12 | [[preventing-stemming]]
|
16 |
| -==== Preventing Stemming |
| 13 | +==== 阻止词干提取 |
17 | 14 |
|
18 |
| -The <<stem-exclusion,`stem_exclusion`>> parameter for language analyzers (see |
19 |
| -<<configuring-language-analyzers>>) allowed ((("stemming words", "controlling stemming", "preventing stemming")))us to specify a list of words that |
20 |
| -should not be stemmed. Internally, these language analyzers use the |
21 |
| -{ref}/analysis-keyword-marker-tokenfilter.html[`keyword_marker` token filter] |
22 |
| -to mark the listed words as _keywords_, which prevents subsequent stemming |
23 |
| -token filters from touching those words.((("keyword_marker token filter", "preventing stemming of certain words"))) |
| 15 | +语言分析器(查看 <<configuring-language-analyzers>>)的参数 <<stem-exclusion,`stem_exclusion`>> |
| 16 | +允许我们指定一个词语列表,让他们不被词干提取。((("stemming words", "controlling stemming", "preventing stemming"))) |
24 | 17 |
|
25 |
| -For instance, we can create a simple custom analyzer that uses the |
26 |
| -{ref}/analysis-porterstem-tokenfilter.html[`porter_stem`] token filter, |
27 |
| -but prevents the word `skies` from((("porter_stem token filter"))) being stemmed: |
| 18 | +在内部,这些语言分析器使用 |
| 19 | +{ref}/analysis-keyword-marker-tokenfilter.html[`keyword_marker` 语汇单元过滤器] |
| 20 | +来标记这些词语列表为 _keywords_ ,用来阻止后续的词干提取过滤器来触碰这些词语。((("keyword_marker token filter", "preventing stemming of certain words"))) |
| 21 | + |
| 22 | +例如,我们创建一个简单自定义分析器,使用 |
| 23 | +{ref}/analysis-porterstem-tokenfilter.html[`porter_stem`] 语汇单元过滤器,同时阻止 `skies` 的词干提取:((("porter_stem token filter"))) |
28 | 24 |
|
29 | 25 | [source,json]
|
30 | 26 | ------------------------------------------
|
@@ -52,41 +48,34 @@ PUT /my_index
|
52 | 48 | }
|
53 | 49 | }
|
54 | 50 | ------------------------------------------
|
55 |
| -<1> They `keywords` parameter could accept multiple words. |
| 51 | +<1> 参数 `keywords` 可以允许接收多个词语。 |
56 | 52 |
|
57 |
| -Testing it with the `analyze` API shows that just the word `skies` has |
58 |
| -been excluded from stemming: |
| 53 | +使用 `analyze` API 来测试,可以看到词 `skies` 没有被提取: |
59 | 54 |
|
60 | 55 | [source,json]
|
61 | 56 | ------------------------------------------
|
62 | 57 | GET /my_index/_analyze?analyzer=my_english
|
63 | 58 | sky skies skiing skis <1>
|
64 | 59 | ------------------------------------------
|
65 |
| -<1> Returns: `sky`, `skies`, `ski`, `ski` |
| 60 | +<1> 返回: `sky`, `skies`, `ski`, `ski` |
66 | 61 |
|
67 | 62 | [[keyword-path]]
|
68 | 63 |
|
69 | 64 | [TIP]
|
70 | 65 | ==========================================
|
71 | 66 |
|
72 |
| -While the language analyzers allow ((("language analyzers", "stem_exclusion parameter")))us only to specify an array of words in the |
73 |
| -`stem_exclusion` parameter, the `keyword_marker` token filter also accepts a |
74 |
| -`keywords_path` parameter that allows us to store all of our keywords in a |
75 |
| -file. ((("keyword_marker token filter", "keywords_path parameter")))The file should contain one word per line, and must be present on every |
76 |
| -node in the cluster. See <<updating-stopwords>> for tips on how to update this |
77 |
| -file. |
| 67 | +虽然语言分析器只允许我们通过参数 `stem_exclusion` 指定一个词语列表来排除词干提取,((("language analyzers", "stem_exclusion parameter"))) |
| 68 | +不过 `keyword_marker` 语汇单元过滤器同样还接收一个 `keywords_path` 参数允许我们将所有的关键字存在一个文件。 |
| 69 | +这个文件应该是每行一个字,并且存在于集群的每个节点。查看 <<updating-stopwords>> 了解更新这些文件的提示。 |
78 | 70 |
|
79 | 71 | ==========================================
|
80 | 72 |
|
81 | 73 | [[customizing-stemming]]
|
82 |
| -==== Customizing Stemming |
| 74 | +==== 自定义提取 |
83 | 75 |
|
84 |
| -In the preceding example, we prevented `skies` from being stemmed, but perhaps we |
85 |
| -would prefer it to be stemmed to `sky` instead.((("stemming words", "controlling stemming", "customizing stemming"))) The |
86 |
| -{ref}/analysis-stemmer-override-tokenfilter.html[`stemmer_override`] token |
87 |
| -filter allows us ((("stemmer_override token filter")))to specify our own custom stemming rules. At the same time, |
88 |
| -we can handle some irregular forms like stemming `mice` to `mouse` and `feet` |
89 |
| -to `foot`: |
| 76 | +在上面的例子中,我们阻止了 `skies` 被词干提取,但是也许我们希望他能被提干为 `sky` 。((("stemming words", "controlling stemming", "customizing stemming"))) The |
| 77 | +{ref}/analysis-stemmer-override-tokenfilter.html[`stemmer_override`] 语汇单元过滤器允许我们指定自定义的提取规则。((("stemmer_override token filter"))) |
| 78 | +与此同时,我们可以处理一些不规则的形式,如:`mice` 提取为 `mouse` 和 `feet` 到 `foot` : |
90 | 79 |
|
91 | 80 | [source,json]
|
92 | 81 | ------------------------------------------
|
@@ -121,11 +110,9 @@ PUT /my_index
|
121 | 110 | GET /my_index/_analyze?analyzer=my_english
|
122 | 111 | The mice came down from the skies and ran over my feet <3>
|
123 | 112 | ------------------------------------------
|
124 |
| -<1> Rules take the form `original=>stem`. |
125 |
| -<2> The `stemmer_override` filter must be placed before the stemmer. |
126 |
| -<3> Returns `the`, `mouse`, `came`, `down`, `from`, `the`, `sky`, |
127 |
| - `and`, `ran`, `over`, `my`, `foot`. |
128 |
| - |
129 |
| -TIP: Just as for the `keyword_marker` token filter, rules can be stored |
130 |
| -in a file whose location should be specified with the `rules_path` |
131 |
| -parameter. |
| 113 | +<1> 规则来自 `original=>stem` 。 |
| 114 | +<2> `stemmer_override` 过滤器必须放置在词干提取器之前。 |
| 115 | +<3> 返回 `the`, `mouse`, `came`, `down`, `from`, `the`, `sky`, |
| 116 | + `and`, `ran`, `over`, `my`, `foot` 。 |
| 117 | + |
| 118 | +TIP: 正如 `keyword_marker` 语汇单元过滤器,规则可以被存放在一个文件中,通过参数 `rules_path` 来指定位置。 |
0 commit comments