@@ -20,53 +20,6 @@ NOTE: If you have considerable memory allocated to your JVM but are receiving ci
2020[[bucket-categorize-text-agg-syntax]]
2121==== Parameters
2222
23- `field`::
24- (Required, string)
25- The semi-structured text field to categorize.
26-
27- `max_unique_tokens`::
28- (Optional, integer, default: `50`)
29- The maximum number of unique tokens at any position up to `max_matched_tokens`.
30- Must be larger than 1. Smaller values use less memory and create fewer categories.
31- Larger values will use more memory and create narrower categories.
32- Max allowed value is `100`.
33-
34- `max_matched_tokens`::
35- (Optional, integer, default: `5`)
36- The maximum number of token positions to match on before attempting to merge categories.
37- Larger values will use more memory and create narrower categories.
38- Max allowed value is `100`.
39-
40- Example:
41- `max_matched_tokens` of 2 would disallow merging of the categories
42- [`foo` `bar` `baz`]
43- [`foo` `baz` `bozo`]
44- As the first 2 tokens are required to match for the category.
45-
46- NOTE: Once `max_unique_tokens` is reached at a given position, a new `*` token is
47- added and all new tokens at that position are matched by the `*` token.
48-
49- `similarity_threshold`::
50- (Optional, integer, default: `50`)
51- The minimum percentage of tokens that must match for text to be added to the
52- category bucket.
53- Must be between 1 and 100. The larger the value the narrower the categories.
54- Larger values will increase memory usage and create narrower categories.
55-
56- `categorization_filters`::
57- (Optional, array of strings)
58- This property expects an array of regular expressions. The expressions
59- are used to filter out matching sequences from the categorization field values.
60- You can use this functionality to fine tune the categorization by excluding
61- sequences from consideration when categories are defined. For example, you can
62- exclude SQL statements that appear in your log files. This
63- property cannot be used at the same time as `categorization_analyzer`. If you
64- only want to define simple regular expression filters that are applied prior to
65- tokenization, setting this property is the easiest method. If you also want to
66- customize the tokenizer or post-tokenization filtering, use the
67- `categorization_analyzer` property instead and include the filters as
68- `pattern_replace` character filters.
69-
7023`categorization_analyzer`::
7124(Optional, object or string)
7225The categorization analyzer specifies how the text is analyzed and tokenized before
@@ -95,14 +48,33 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=tokenizer]
9548include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=filter]
9649=====
9750
98- `shard_size`::
51+ `categorization_filters`::
52+ (Optional, array of strings)
53+ This property expects an array of regular expressions. The expressions
54+ are used to filter out matching sequences from the categorization field values.
55+ You can use this functionality to fine tune the categorization by excluding
56+ sequences from consideration when categories are defined. For example, you can
57+ exclude SQL statements that appear in your log files. This
58+ property cannot be used at the same time as `categorization_analyzer`. If you
59+ only want to define simple regular expression filters that are applied prior to
60+ tokenization, setting this property is the easiest method. If you also want to
61+ customize the tokenizer or post-tokenization filtering, use the
62+ `categorization_analyzer` property instead and include the filters as
63+ `pattern_replace` character filters.
64+
65+ `field`::
66+ (Required, string)
67+ The semi-structured text field to categorize.
68+
69+ `max_matched_tokens`::
9970(Optional, integer)
100- The number of categorization buckets to return from each shard before merging
101- all the results .
71+ This parameter does nothing now, but is permitted for compatibility with the original
72+ implementation .
10273
103- `size`::
104- (Optional, integer, default: `10`)
105- The number of buckets to return.
74+ `max_unique_tokens`::
75+ (Optional, integer)
76+ This parameter does nothing now, but is permitted for compatibility with the original
77+ implementation.
10678
10779`min_doc_count`::
10880(Optional, integer)
@@ -113,8 +85,23 @@ The minimum number of documents for a bucket to be returned to the results.
11385The minimum number of documents for a bucket to be returned from the shard before
11486merging.
11587
116- ==== Basic use
88+ `shard_size`::
89+ (Optional, integer)
90+ The number of categorization buckets to return from each shard before merging
91+ all the results.
92+
93+ `similarity_threshold`::
94+ (Optional, integer, default: `70`)
95+ The minimum percentage of token weight that must match for text to be added to the
96+ category bucket.
97+ Must be between 1 and 100. The larger the value the narrower the categories.
98+ Larger values will increase memory usage and create narrower categories.
11799
100+ `size`::
101+ (Optional, integer, default: `10`)
102+ The number of buckets to return.
103+
104+ ==== Basic use
118105
119106WARNING: Re-analyzing _large_ result sets will require a lot of time and memory. This aggregation should be
120107 used in conjunction with <<async-search, Async search>>. Additionally, you may consider
@@ -223,11 +210,15 @@ category results
223210--------------------------------------------------
224211
225212Here is an example using `categorization_filters`.
226- The default analyzer is a whitespace analyzer with a custom token filter
227- which filters out tokens that start with any number.
213+ The default analyzer uses the `ml_standard` tokenizer which is similar to a whitespace tokenizer
214+ but filters out tokens that could be interpreted as hexadecimal numbers. The default analyzer
215+ also uses the `first_line_with_letters` character filter, so that only the first meaningful line
216+ of multi-line messages is considered.
228217But, it may be that a token is a known highly-variable token (formatted usernames, emails, etc.). In that case, it is good to supply
229- custom `categorization_filters` to filter out those tokens for better categories. These filters will also reduce memory usage as fewer
230- tokens are held in memory for the categories.
218+ custom `categorization_filters` to filter out those tokens for better categories. These filters may also reduce memory usage as fewer
219+ tokens are held in memory for the categories. (If there are sufficient examples of different usernames, emails, etc., then
220+ categories will form that naturally discard them as variables, but for small input data where only one example exists this won't
221+ happen.)
231222
232223[source,console]
233224--------------------------------------------------
@@ -238,8 +229,7 @@ POST log-messages/_search?filter_path=aggregations
238229 "categorize_text": {
239230 "field": "message",
240231 "categorization_filters": ["\\w+\\_\\d{3}"], <1>
241- "max_matched_tokens": 2, <2>
242- "similarity_threshold": 30 <3>
232+ "similarity_threshold": 30 <2>
243233 }
244234 }
245235 }
@@ -248,12 +238,12 @@ POST log-messages/_search?filter_path=aggregations
248238// TEST[setup:categorize_text]
249239<1> The filters to apply to the analyzed tokens. It filters
250240out tokens like `bar_123`.
251- <2> Require at least 2 tokens before the log categories attempt to merge together
252- <3> Require 30% of the tokens to match before expanding a log categories
253- to add a new log entry
241+ <2> Require 30% of token weight to match before adding a message to an
242+ existing category rather than creating a new one.
254243
255- The resulting categories are now broad, matching the first token
256- and merging the log groups.
244+ The resulting categories are now very broad, merging the log groups.
245+ (A `similarity_threshold` of 30% is generally too low. Settings over
246+ 50% are usually better.)
257247
258248[source,console-result]
259249--------------------------------------------------
@@ -263,11 +253,11 @@ and merging the log groups.
263253 "buckets" : [
264254 {
265255 "doc_count" : 4,
266- "key" : "Node * "
256+ "key" : "Node"
267257 },
268258 {
269259 "doc_count" : 2,
270- "key" : "User * "
260+ "key" : "User"
271261 }
272262 ]
273263 }
0 commit comments