Skip to content

Commit 010db38

Browse files
author
David Roberts
committed
[ML] Replace the implementation of the categorize_text aggregation
This replaces the implementation of the `categorize_text` aggregation with the new algorithm that was added in elastic#80867. The new algorithm works in the same way as the ML C++ code used for categorization jobs. The docs are updated to reflect the workings of the new implementation.
1 parent fede927 commit 010db38

File tree

42 files changed

+367
-3326
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

42 files changed

+367
-3326
lines changed

docs/reference/aggregations/bucket/categorize-text-aggregation.asciidoc

Lines changed: 57 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -20,53 +20,6 @@ NOTE: If you have considerable memory allocated to your JVM but are receiving ci
2020
[[bucket-categorize-text-agg-syntax]]
2121
==== Parameters
2222

23-
`field`::
24-
(Required, string)
25-
The semi-structured text field to categorize.
26-
27-
`max_unique_tokens`::
28-
(Optional, integer, default: `50`)
29-
The maximum number of unique tokens at any position up to `max_matched_tokens`.
30-
Must be larger than 1. Smaller values use less memory and create fewer categories.
31-
Larger values will use more memory and create narrower categories.
32-
Max allowed value is `100`.
33-
34-
`max_matched_tokens`::
35-
(Optional, integer, default: `5`)
36-
The maximum number of token positions to match on before attempting to merge categories.
37-
Larger values will use more memory and create narrower categories.
38-
Max allowed value is `100`.
39-
40-
Example:
41-
`max_matched_tokens` of 2 would disallow merging of the categories
42-
[`foo` `bar` `baz`]
43-
[`foo` `baz` `bozo`]
44-
As the first 2 tokens are required to match for the category.
45-
46-
NOTE: Once `max_unique_tokens` is reached at a given position, a new `*` token is
47-
added and all new tokens at that position are matched by the `*` token.
48-
49-
`similarity_threshold`::
50-
(Optional, integer, default: `50`)
51-
The minimum percentage of tokens that must match for text to be added to the
52-
category bucket.
53-
Must be between 1 and 100. The larger the value the narrower the categories.
54-
Larger values will increase memory usage and create narrower categories.
55-
56-
`categorization_filters`::
57-
(Optional, array of strings)
58-
This property expects an array of regular expressions. The expressions
59-
are used to filter out matching sequences from the categorization field values.
60-
You can use this functionality to fine tune the categorization by excluding
61-
sequences from consideration when categories are defined. For example, you can
62-
exclude SQL statements that appear in your log files. This
63-
property cannot be used at the same time as `categorization_analyzer`. If you
64-
only want to define simple regular expression filters that are applied prior to
65-
tokenization, setting this property is the easiest method. If you also want to
66-
customize the tokenizer or post-tokenization filtering, use the
67-
`categorization_analyzer` property instead and include the filters as
68-
`pattern_replace` character filters.
69-
7023
`categorization_analyzer`::
7124
(Optional, object or string)
7225
The categorization analyzer specifies how the text is analyzed and tokenized before
@@ -95,14 +48,33 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=tokenizer]
9548
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=filter]
9649
=====
9750

98-
`shard_size`::
51+
`categorization_filters`::
52+
(Optional, array of strings)
53+
This property expects an array of regular expressions. The expressions
54+
are used to filter out matching sequences from the categorization field values.
55+
You can use this functionality to fine tune the categorization by excluding
56+
sequences from consideration when categories are defined. For example, you can
57+
exclude SQL statements that appear in your log files. This
58+
property cannot be used at the same time as `categorization_analyzer`. If you
59+
only want to define simple regular expression filters that are applied prior to
60+
tokenization, setting this property is the easiest method. If you also want to
61+
customize the tokenizer or post-tokenization filtering, use the
62+
`categorization_analyzer` property instead and include the filters as
63+
`pattern_replace` character filters.
64+
65+
`field`::
66+
(Required, string)
67+
The semi-structured text field to categorize.
68+
69+
`max_matched_tokens`::
9970
(Optional, integer)
100-
The number of categorization buckets to return from each shard before merging
101-
all the results.
71+
This parameter does nothing now, but is permitted for compatibility with the original
72+
implementation.
10273

103-
`size`::
104-
(Optional, integer, default: `10`)
105-
The number of buckets to return.
74+
`max_unique_tokens`::
75+
(Optional, integer)
76+
This parameter does nothing now, but is permitted for compatibility with the original
77+
implementation.
10678

10779
`min_doc_count`::
10880
(Optional, integer)
@@ -113,8 +85,23 @@ The minimum number of documents for a bucket to be returned to the results.
11385
The minimum number of documents for a bucket to be returned from the shard before
11486
merging.
11587

116-
==== Basic use
88+
`shard_size`::
89+
(Optional, integer)
90+
The number of categorization buckets to return from each shard before merging
91+
all the results.
92+
93+
`similarity_threshold`::
94+
(Optional, integer, default: `70`)
95+
The minimum percentage of token weight that must match for text to be added to the
96+
category bucket.
97+
Must be between 1 and 100. The larger the value the narrower the categories.
98+
Larger values will increase memory usage and create narrower categories.
11799

100+
`size`::
101+
(Optional, integer, default: `10`)
102+
The number of buckets to return.
103+
104+
==== Basic use
118105

119106
WARNING: Re-analyzing _large_ result sets will require a lot of time and memory. This aggregation should be
120107
used in conjunction with <<async-search, Async search>>. Additionally, you may consider
@@ -223,11 +210,15 @@ category results
223210
--------------------------------------------------
224211

225212
Here is an example using `categorization_filters`.
226-
The default analyzer is a whitespace analyzer with a custom token filter
227-
which filters out tokens that start with any number.
213+
The default analyzer uses the `ml_standard` tokenizer which is similar to a whitespace tokenizer
214+
but filters out tokens that could be interpreted as hexadecimal numbers. The default analyzer
215+
also uses the `first_line_with_letters` character filter, so that only the first meaningful line
216+
of multi-line messages is considered.
228217
But, it may be that a token is a known highly-variable token (formatted usernames, emails, etc.). In that case, it is good to supply
229-
custom `categorization_filters` to filter out those tokens for better categories. These filters will also reduce memory usage as fewer
230-
tokens are held in memory for the categories.
218+
custom `categorization_filters` to filter out those tokens for better categories. These filters may also reduce memory usage as fewer
219+
tokens are held in memory for the categories. (If there are sufficient examples of different usernames, emails, etc., then
220+
categories will form that naturally discard them as variables, but for small input data where only one example exists this won't
221+
happen.)
231222

232223
[source,console]
233224
--------------------------------------------------
@@ -238,8 +229,7 @@ POST log-messages/_search?filter_path=aggregations
238229
"categorize_text": {
239230
"field": "message",
240231
"categorization_filters": ["\\w+\\_\\d{3}"], <1>
241-
"max_matched_tokens": 2, <2>
242-
"similarity_threshold": 30 <3>
232+
"similarity_threshold": 30 <2>
243233
}
244234
}
245235
}
@@ -248,12 +238,12 @@ POST log-messages/_search?filter_path=aggregations
248238
// TEST[setup:categorize_text]
249239
<1> The filters to apply to the analyzed tokens. It filters
250240
out tokens like `bar_123`.
251-
<2> Require at least 2 tokens before the log categories attempt to merge together
252-
<3> Require 30% of the tokens to match before expanding a log categories
253-
to add a new log entry
241+
<2> Require 30% of token weight to match before adding a message to an
242+
existing category rather than creating a new one.
254243

255-
The resulting categories are now broad, matching the first token
256-
and merging the log groups.
244+
The resulting categories are now very broad, merging the log groups.
245+
(A `similarity_threshold` of 30% is generally too low. Settings over
246+
50% are usually better.)
257247

258248
[source,console-result]
259249
--------------------------------------------------
@@ -263,11 +253,11 @@ and merging the log groups.
263253
"buckets" : [
264254
{
265255
"doc_count" : 4,
266-
"key" : "Node *"
256+
"key" : "Node"
267257
},
268258
{
269259
"doc_count" : 2,
270-
"key" : "User *"
260+
"key" : "User"
271261
}
272262
]
273263
}

x-pack/plugin/ml/src/internalClusterTest/java/org/elasticsearch/xpack/ml/integration/CategorizationAggregationIT.java renamed to x-pack/plugin/ml/src/internalClusterTest/java/org/elasticsearch/xpack/ml/integration/CategorizeTextAggregationIT.java

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@
2727
import static org.hamcrest.Matchers.not;
2828
import static org.hamcrest.Matchers.notANumber;
2929

30-
public class CategorizationAggregationIT extends BaseMlIntegTestCase {
30+
public class CategorizeTextAggregationIT extends BaseMlIntegTestCase {
3131

3232
private static final String DATA_INDEX = "categorization-agg-data";
3333

@@ -77,17 +77,17 @@ public void testAggregationWithBroadCategories() {
7777
.setSize(0)
7878
.setTrackTotalHits(false)
7979
.addAggregation(
80+
// Overriding the similarity threshold to just 11% (default is 70%) results in the
81+
// "Node started" and "Node stopped" messages being grouped in the same category
8082
new CategorizeTextAggregationBuilder("categorize", "msg").setSimilarityThreshold(11)
81-
.setMaxUniqueTokens(2)
82-
.setMaxMatchedTokens(1)
8383
.subAggregation(AggregationBuilders.max("max").field("time"))
8484
.subAggregation(AggregationBuilders.min("min").field("time"))
8585
)
8686
.get();
8787
InternalCategorizationAggregation agg = response.getAggregations().get("categorize");
8888
assertThat(agg.getBuckets(), hasSize(2));
8989

90-
assertCategorizationBucket(agg.getBuckets().get(0), "Node *", 4);
90+
assertCategorizationBucket(agg.getBuckets().get(0), "Node", 4);
9191
assertCategorizationBucket(agg.getBuckets().get(1), "Failed to shutdown error org.aaaa.bbbb.Cccc line caused by foo exception", 2);
9292
}
9393

x-pack/plugin/ml/src/internalClusterTest/java/org/elasticsearch/xpack/ml/integration/CategorizeTextDistributedIT.java

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,8 @@
1616
import org.elasticsearch.cluster.metadata.IndexMetadata;
1717
import org.elasticsearch.cluster.routing.ShardRouting;
1818
import org.elasticsearch.common.settings.Settings;
19-
import org.elasticsearch.xpack.ml.aggs.categorization2.CategorizeTextAggregationBuilder;
20-
import org.elasticsearch.xpack.ml.aggs.categorization2.InternalCategorizationAggregation;
19+
import org.elasticsearch.xpack.ml.aggs.categorization.CategorizeTextAggregationBuilder;
20+
import org.elasticsearch.xpack.ml.aggs.categorization.InternalCategorizationAggregation;
2121
import org.elasticsearch.xpack.ml.support.BaseMlIntegTestCase;
2222

2323
import java.util.Arrays;

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/MachineLearning.java

Lines changed: 1 addition & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1417,16 +1417,7 @@ public List<AggregationSpec> getAggregations() {
14171417
CategorizeTextAggregationBuilder::new,
14181418
CategorizeTextAggregationBuilder.PARSER
14191419
).addResultReader(InternalCategorizationAggregation::new)
1420-
.setAggregatorRegistrar(s -> s.registerUsage(CategorizeTextAggregationBuilder.NAME)),
1421-
// TODO: in the long term only keep one or other of these categorization aggregations
1422-
new AggregationSpec(
1423-
org.elasticsearch.xpack.ml.aggs.categorization2.CategorizeTextAggregationBuilder.NAME,
1424-
org.elasticsearch.xpack.ml.aggs.categorization2.CategorizeTextAggregationBuilder::new,
1425-
org.elasticsearch.xpack.ml.aggs.categorization2.CategorizeTextAggregationBuilder.PARSER
1426-
).addResultReader(org.elasticsearch.xpack.ml.aggs.categorization2.InternalCategorizationAggregation::new)
1427-
.setAggregatorRegistrar(
1428-
s -> s.registerUsage(org.elasticsearch.xpack.ml.aggs.categorization2.CategorizeTextAggregationBuilder.NAME)
1429-
)
1420+
.setAggregatorRegistrar(s -> s.registerUsage(CategorizeTextAggregationBuilder.NAME))
14301421
);
14311422
}
14321423

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/categorization/CategorizationBytesRefHash.java

Lines changed: 13 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -15,14 +15,6 @@
1515

1616
class CategorizationBytesRefHash implements Releasable {
1717

18-
/**
19-
* Our special wild card value.
20-
*/
21-
static final BytesRef WILD_CARD_REF = new BytesRef("*");
22-
/**
23-
* For all WILD_CARD references, the token ID is always -1
24-
*/
25-
static final int WILD_CARD_ID = -1;
2618
private final BytesRefHash bytesRefHash;
2719

2820
CategorizationBytesRefHash(BytesRefHash bytesRefHash) {
@@ -46,34 +38,28 @@ BytesRef[] getDeeps(int[] ids) {
4638
}
4739

4840
BytesRef getDeep(long id) {
49-
if (id == WILD_CARD_ID) {
50-
return WILD_CARD_REF;
51-
}
5241
BytesRef shallow = bytesRefHash.get(id, new BytesRef());
5342
return BytesRef.deepCopyOf(shallow);
5443
}
5544

5645
int put(BytesRef bytesRef) {
57-
if (WILD_CARD_REF.equals(bytesRef)) {
58-
return WILD_CARD_ID;
59-
}
6046
long hash = bytesRefHash.add(bytesRef);
6147
if (hash < 0) {
48+
// BytesRefHash returns -1 - hash if the entry already existed, but we just want to return the hash
6249
return (int) (-1L - hash);
63-
} else {
64-
if (hash > Integer.MAX_VALUE) {
65-
throw new AggregationExecutionException(
66-
LoggerMessageFormat.format(
67-
"more than [{}] unique terms encountered. "
68-
+ "Consider restricting the documents queried or adding [{}] in the {} configuration",
69-
Integer.MAX_VALUE,
70-
CategorizeTextAggregationBuilder.CATEGORIZATION_FILTERS.getPreferredName(),
71-
CategorizeTextAggregationBuilder.NAME
72-
)
73-
);
74-
}
75-
return (int) hash;
7650
}
51+
if (hash > Integer.MAX_VALUE) {
52+
throw new AggregationExecutionException(
53+
LoggerMessageFormat.format(
54+
"more than [{}] unique terms encountered. "
55+
+ "Consider restricting the documents queried or adding [{}] in the {} configuration",
56+
Integer.MAX_VALUE,
57+
CategorizeTextAggregationBuilder.CATEGORIZATION_FILTERS.getPreferredName(),
58+
CategorizeTextAggregationBuilder.NAME
59+
)
60+
);
61+
}
62+
return (int) hash;
7763
}
7864

7965
@Override

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/categorization2/CategorizationPartOfSpeechDictionary.java renamed to x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/categorization/CategorizationPartOfSpeechDictionary.java

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
* 2.0.
66
*/
77

8-
package org.elasticsearch.xpack.ml.aggs.categorization2;
8+
package org.elasticsearch.xpack.ml.aggs.categorization;
99

1010
import java.io.BufferedReader;
1111
import java.io.IOException;
@@ -25,7 +25,7 @@
2525
*/
2626
public class CategorizationPartOfSpeechDictionary {
2727

28-
static final String DICTIONARY_FILE_PATH = "/org/elasticsearch/xpack/ml/aggs/categorization2/ml-en.dict";
28+
static final String DICTIONARY_FILE_PATH = "/org/elasticsearch/xpack/ml/aggs/categorization/ml-en.dict";
2929

3030
static final String PART_OF_SPEECH_SEPARATOR = "@";
3131

0 commit comments

Comments
 (0)