Skip to content

[ML] Truncate categorization fields #89827

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Sep 9, 2022

Conversation

edsavage
Copy link
Contributor

@edsavage edsavage commented Sep 6, 2022

Truncate the raw categorization field passed to the backend at 1001 characters. This prevents excessively long messages being sent to the C++ backend that will only be truncated there anyway.

Truncate the raw categorization field passed to the backend at 1001 characters.
@edsavage edsavage added >enhancement :ml Machine learning Team:ML Meta label for the ML team v8.5.0 v7.17.7 v8.4.2 labels Sep 6, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@@ -336,6 +338,13 @@ public List<String> fields() {
return collectNonNullAndNonEmptyDetectorFields(Detector::getFieldName);
}

public String maybeTruncateCatgeorizationField(String categorizationField) {
if (termFields().contains(categorizationFieldName) == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really inefficient because termFields() is building a Set from scratch each time it's called, and then that Set becomes garbage after just one lookup.

It needs to be changed to build the set once and cache it. That probably means moving this logic into the AbstractDataToProcessWriter class, which could cache the term fields on construction.

@@ -69,6 +69,8 @@ public class AnalysisConfig implements ToXContentObject, Writeable {
public static final String ML_CATEGORY_FIELD = "mlcategory";
public static final Set<String> AUTO_CREATED_FIELDS = new HashSet<>(Collections.singletonList(ML_CATEGORY_FIELD));

public static final int MAX_CATEGORIZATION_FIELD_LENGTH = 1001;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment here saying why this is 1001.

Say that the C++ truncates at length 1000 (which is model::CCategoryExamplesCollector::MAX_EXAMPLE_LENGTH), and adds an ellipsis on truncation, hence we need to send more than that so this logic still works.

And also say that because we do the tokenization on the Java side now the tokens will still be sent correctly (separately) even if they extend beyond the length of a truncated example.

@@ -160,6 +160,11 @@ private void writeJson(CategorizationAnalyzer categorizationAnalyzer, XContentPa
if (categorizationAnalyzer != null && categorizationFieldIndex != null) {
tokenizeForCategorization(categorizationAnalyzer, input[categorizationFieldIndex], record);
}

if (categorizationFieldIndex != null) {
record[categorizationFieldIndex] = analysisConfig.maybeTruncateCatgeorizationField(record[categorizationFieldIndex]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is assuming indexes into the input array are identical to indexes into the output array. I don't think that's always the case, otherwise lines 156-157 are pointless. So I think we'll need to separately remember the index into the output array or else do the truncation at a different point. Since we tokenise the value from the input array we could do the truncation in the loop around lines 156/157 so the output array contains a different string to the input array for this value.

@edsavage
Copy link
Contributor Author

edsavage commented Sep 9, 2022

@elasticmachine update branch

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now apart from a couple of nits

@@ -69,6 +69,12 @@ public class AnalysisConfig implements ToXContentObject, Writeable {
public static final String ML_CATEGORY_FIELD = "mlcategory";
public static final Set<String> AUTO_CREATED_FIELDS = new HashSet<>(Collections.singletonList(ML_CATEGORY_FIELD));

// Since the C++ backend truncates the categorization field at length 1000, adding an ellipsis on truncation, it makes no sense to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also mention it's in model::CCategoryExamplesCollector::MAX_EXAMPLE_LENGTH in case someone needs to tie up the two sides in the future.

// send potentially very long strings to it. For the backend logic still to work we need to send more than that, hence we truncate
// at length 1001.
//
// Also, because we do the tokenization on the Java side now the tokens will still be sent correctly (separately) to the C++ backend
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the time has come to delete the constant and the 6 places where it's used:

public static final boolean CATEGORIZATION_TOKENIZATION_IN_JAVA = true;

Then this comment really will be true always.

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@edsavage edsavage merged commit cc4db44 into elastic:main Sep 9, 2022
edsavage added a commit to edsavage/elasticsearch that referenced this pull request Sep 9, 2022
Truncate the raw categorization field passed to the backend at 1001 characters.
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Sep 9, 2022
* main: (176 commits)
  Fix RandomSamplerAggregatorTests testAggregationSamplingNestedAggsScaled test failure (elastic#89958)
  [Downsampling] Replace document map with SMILE encoded doc (elastic#89495)
  Remove full cluster state from error logging in MasterService (elastic#89960)
  [ML] Truncate categorization fields (elastic#89827)
  [TSDB] Removed `summary` and `histogram` metric types (elastic#89937)
  Update testNodeSelectorRouting so that it does not depend on iteration order (elastic#89879)
  Make sure listener is resolved when file queue is cleared (elastic#89929)
  [Stable plugin api] Extensible annotation (elastic#89903)
  Fix double sending of response in TransportOpenIdConnectPrepareAuthenticationAction (elastic#89930)
  Make sure ivy repo directory exists before downloading artifacts
  Use 'file://' scheme for local repository URL
  Use DRA artifacts for release build CI jobs
  Log unsuccessful attempts to get credentials from web identity tokens (elastic#88241)
  Script: Write Field API path manipulation (elastic#89889)
  Fetch health info action (elastic#89820)
  Fix memory leak in TransportDeleteExpiredDataAction (elastic#89935)
  [ML] Performance improvements for categorization jobs (elastic#89824)
  [DOCS] Revert changes for ES_JAVA_OPTS (elastic#89931)
  Fix deadlock bug exposed by a test (elastic#89934)
  [Downsampling] Remove `FieldValueFetcher` validator (elastic#89497)
  ...
edsavage added a commit that referenced this pull request Sep 9, 2022
Truncate the raw categorization field passed to the backend at 1001 characters.
@edsavage edsavage deleted the truncate_categorization_fields branch September 9, 2022 15:12
edsavage added a commit that referenced this pull request Sep 9, 2022
Truncate the raw categorization field passed to the backend at 1001 characters.

backports #89827
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :ml Machine learning Team:ML Meta label for the ML team v7.17.7 v8.4.2 v8.5.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants