[ML] Truncate categorization fields #89827

edsavage · 2022-09-06T16:04:20Z

Truncate the raw categorization field passed to the backend at 1001 characters. This prevents excessively long messages being sent to the C++ backend that will only be truncated there anyway.

Truncate the raw categorization field passed to the backend at 1001 characters.

elasticsearchmachine · 2022-09-06T16:04:46Z

Pinging @elastic/ml-core (Team:ML)

droberts195 · 2022-09-06T16:16:47Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/job/config/AnalysisConfig.java

@@ -336,6 +338,13 @@ public List<String> fields() {
        return collectNonNullAndNonEmptyDetectorFields(Detector::getFieldName);
    }

+    public String maybeTruncateCatgeorizationField(String categorizationField) {
+        if (termFields().contains(categorizationFieldName) == false) {


This is really inefficient because termFields() is building a Set from scratch each time it's called, and then that Set becomes garbage after just one lookup.

It needs to be changed to build the set once and cache it. That probably means moving this logic into the AbstractDataToProcessWriter class, which could cache the term fields on construction.

droberts195 · 2022-09-06T16:20:32Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/job/config/AnalysisConfig.java

@@ -69,6 +69,8 @@ public class AnalysisConfig implements ToXContentObject, Writeable {
    public static final String ML_CATEGORY_FIELD = "mlcategory";
    public static final Set<String> AUTO_CREATED_FIELDS = new HashSet<>(Collections.singletonList(ML_CATEGORY_FIELD));

+    public static final int MAX_CATEGORIZATION_FIELD_LENGTH = 1001;


Please add a comment here saying why this is 1001.

Say that the C++ truncates at length 1000 (which is model::CCategoryExamplesCollector::MAX_EXAMPLE_LENGTH), and adds an ellipsis on truncation, hence we need to send more than that so this logic still works.

And also say that because we do the tokenization on the Java side now the tokens will still be sent correctly (separately) even if they extend beyond the length of a truncated example.

droberts195 · 2022-09-06T16:24:55Z

...n/java/org/elasticsearch/xpack/ml/job/process/autodetect/writer/JsonDataToProcessWriter.java

@@ -160,6 +160,11 @@ private void writeJson(CategorizationAnalyzer categorizationAnalyzer, XContentPa
            if (categorizationAnalyzer != null && categorizationFieldIndex != null) {
                tokenizeForCategorization(categorizationAnalyzer, input[categorizationFieldIndex], record);
            }
+
+            if (categorizationFieldIndex != null) {
+                record[categorizationFieldIndex] = analysisConfig.maybeTruncateCatgeorizationField(record[categorizationFieldIndex]);


This is assuming indexes into the input array are identical to indexes into the output array. I don't think that's always the case, otherwise lines 156-157 are pointless. So I think we'll need to separately remember the index into the output array or else do the truncation at a different point. Since we tokenise the value from the input array we could do the truncation in the loop around lines 156/157 so the output array contains a different string to the input array for this value.

edsavage · 2022-09-09T09:45:33Z

@elasticmachine update branch

droberts195

Looks good now apart from a couple of nits

droberts195 · 2022-09-09T10:03:28Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/job/config/AnalysisConfig.java

@@ -69,6 +69,12 @@ public class AnalysisConfig implements ToXContentObject, Writeable {
    public static final String ML_CATEGORY_FIELD = "mlcategory";
    public static final Set<String> AUTO_CREATED_FIELDS = new HashSet<>(Collections.singletonList(ML_CATEGORY_FIELD));

+    // Since the C++ backend truncates the categorization field at length 1000, adding an ellipsis on truncation, it makes no sense to


Please also mention it's in model::CCategoryExamplesCollector::MAX_EXAMPLE_LENGTH in case someone needs to tie up the two sides in the future.

droberts195 · 2022-09-09T10:04:47Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/job/config/AnalysisConfig.java

+    // send potentially very long strings to it. For the backend logic still to work we need to send more than that, hence we truncate
+    // at length 1001.
+    //
+    // Also, because we do the tokenization on the Java side now the tokens will still be sent correctly (separately) to the C++ backend


I think the time has come to delete the constant and the 6 places where it's used:

elasticsearch/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/MachineLearning.java

Line 487 in 9e530e1

public static final boolean CATEGORIZATION_TOKENIZATION_IN_JAVA = true;

Then this comment really will be true always.

…elasticsearch into truncate_categorization_fields

droberts195

LGTM

Truncate the raw categorization field passed to the backend at 1001 characters.

* main: (176 commits) Fix RandomSamplerAggregatorTests testAggregationSamplingNestedAggsScaled test failure (elastic#89958) [Downsampling] Replace document map with SMILE encoded doc (elastic#89495) Remove full cluster state from error logging in MasterService (elastic#89960) [ML] Truncate categorization fields (elastic#89827) [TSDB] Removed `summary` and `histogram` metric types (elastic#89937) Update testNodeSelectorRouting so that it does not depend on iteration order (elastic#89879) Make sure listener is resolved when file queue is cleared (elastic#89929) [Stable plugin api] Extensible annotation (elastic#89903) Fix double sending of response in TransportOpenIdConnectPrepareAuthenticationAction (elastic#89930) Make sure ivy repo directory exists before downloading artifacts Use 'file://' scheme for local repository URL Use DRA artifacts for release build CI jobs Log unsuccessful attempts to get credentials from web identity tokens (elastic#88241) Script: Write Field API path manipulation (elastic#89889) Fetch health info action (elastic#89820) Fix memory leak in TransportDeleteExpiredDataAction (elastic#89935) [ML] Performance improvements for categorization jobs (elastic#89824) [DOCS] Revert changes for ES_JAVA_OPTS (elastic#89931) Fix deadlock bug exposed by a test (elastic#89934) [Downsampling] Remove `FieldValueFetcher` validator (elastic#89497) ...

Truncate the raw categorization field passed to the backend at 1001 characters.

Truncate the raw categorization field passed to the backend at 1001 characters. backports #89827

[ML] Truncate categorization fields

d108f51

Truncate the raw categorization field passed to the backend at 1001 characters.

edsavage added >enhancement :ml Machine learning Team:ML Meta label for the ML team v8.5.0 v7.17.7 v8.4.2 labels Sep 6, 2022

droberts195 reviewed Sep 6, 2022

View reviewed changes

Attend to review comments

4471a36

Merge branch 'main' into truncate_categorization_fields

1b5bc0c

droberts195 reviewed Sep 9, 2022

View reviewed changes

droberts195 added the auto-backport-and-merge label Sep 9, 2022

edsavage added 2 commits September 9, 2022 11:48

Attend to further review comments

6bcb6fa

Merge branch 'truncate_categorization_fields' of github.com:edsavage/…

f35cdcb

…elasticsearch into truncate_categorization_fields

droberts195 approved these changes Sep 9, 2022

View reviewed changes

edsavage merged commit cc4db44 into elastic:main Sep 9, 2022

edsavage added a commit to edsavage/elasticsearch that referenced this pull request Sep 9, 2022

[ML] Truncate categorization fields (elastic#89827)

f1f95b2

Truncate the raw categorization field passed to the backend at 1001 characters.

This was referenced Sep 9, 2022

[8.4][ML] Truncate categorization fields (#89827) #89961

Merged

[7.17][ML] Truncate categorization fields (#89827) #89964

Merged

edsavage added a commit that referenced this pull request Sep 9, 2022

[ML] Truncate categorization fields (#89827) (#89961)

635de5e

Truncate the raw categorization field passed to the backend at 1001 characters.

edsavage deleted the truncate_categorization_fields branch September 9, 2022 15:12

edsavage added a commit that referenced this pull request Sep 9, 2022

[7.17][ML] Truncate categorization fields (#89827) (#89964)

85058a7

Truncate the raw categorization field passed to the backend at 1001 characters. backports #89827

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Truncate categorization fields #89827

[ML] Truncate categorization fields #89827

Uh oh!

edsavage commented Sep 6, 2022 •

edited

Loading

Uh oh!

elasticsearchmachine commented Sep 6, 2022

Uh oh!

droberts195 Sep 6, 2022

Uh oh!

droberts195 Sep 6, 2022

Uh oh!

droberts195 Sep 6, 2022

Uh oh!

edsavage commented Sep 9, 2022

Uh oh!

droberts195 left a comment

Uh oh!

droberts195 Sep 9, 2022

Uh oh!

droberts195 Sep 9, 2022

Uh oh!

droberts195 left a comment

Uh oh!

Uh oh!

[ML] Truncate categorization fields #89827

[ML] Truncate categorization fields #89827

Uh oh!

Conversation

edsavage commented Sep 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Sep 6, 2022

Uh oh!

droberts195 Sep 6, 2022

Choose a reason for hiding this comment

Uh oh!

droberts195 Sep 6, 2022

Choose a reason for hiding this comment

Uh oh!

droberts195 Sep 6, 2022

Choose a reason for hiding this comment

Uh oh!

edsavage commented Sep 9, 2022

Uh oh!

droberts195 left a comment

Choose a reason for hiding this comment

Uh oh!

droberts195 Sep 9, 2022

Choose a reason for hiding this comment

Uh oh!

droberts195 Sep 9, 2022

Choose a reason for hiding this comment

Uh oh!

droberts195 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

edsavage commented Sep 6, 2022 •

edited

Loading