-
Notifications
You must be signed in to change notification settings - Fork 25.4k
[ML] Truncate categorization fields #89827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Truncate categorization fields #89827
Conversation
Truncate the raw categorization field passed to the backend at 1001 characters.
Pinging @elastic/ml-core (Team:ML) |
@@ -336,6 +338,13 @@ public List<String> fields() { | |||
return collectNonNullAndNonEmptyDetectorFields(Detector::getFieldName); | |||
} | |||
|
|||
public String maybeTruncateCatgeorizationField(String categorizationField) { | |||
if (termFields().contains(categorizationFieldName) == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really inefficient because termFields()
is building a Set
from scratch each time it's called, and then that Set
becomes garbage after just one lookup.
It needs to be changed to build the set once and cache it. That probably means moving this logic into the AbstractDataToProcessWriter
class, which could cache the term fields on construction.
@@ -69,6 +69,8 @@ public class AnalysisConfig implements ToXContentObject, Writeable { | |||
public static final String ML_CATEGORY_FIELD = "mlcategory"; | |||
public static final Set<String> AUTO_CREATED_FIELDS = new HashSet<>(Collections.singletonList(ML_CATEGORY_FIELD)); | |||
|
|||
public static final int MAX_CATEGORIZATION_FIELD_LENGTH = 1001; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment here saying why this is 1001.
Say that the C++ truncates at length 1000 (which is model::CCategoryExamplesCollector::MAX_EXAMPLE_LENGTH
), and adds an ellipsis on truncation, hence we need to send more than that so this logic still works.
And also say that because we do the tokenization on the Java side now the tokens will still be sent correctly (separately) even if they extend beyond the length of a truncated example.
@@ -160,6 +160,11 @@ private void writeJson(CategorizationAnalyzer categorizationAnalyzer, XContentPa | |||
if (categorizationAnalyzer != null && categorizationFieldIndex != null) { | |||
tokenizeForCategorization(categorizationAnalyzer, input[categorizationFieldIndex], record); | |||
} | |||
|
|||
if (categorizationFieldIndex != null) { | |||
record[categorizationFieldIndex] = analysisConfig.maybeTruncateCatgeorizationField(record[categorizationFieldIndex]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is assuming indexes into the input array are identical to indexes into the output array. I don't think that's always the case, otherwise lines 156-157 are pointless. So I think we'll need to separately remember the index into the output array or else do the truncation at a different point. Since we tokenise the value from the input array we could do the truncation in the loop around lines 156/157 so the output array contains a different string to the input array for this value.
@elasticmachine update branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good now apart from a couple of nits
@@ -69,6 +69,12 @@ public class AnalysisConfig implements ToXContentObject, Writeable { | |||
public static final String ML_CATEGORY_FIELD = "mlcategory"; | |||
public static final Set<String> AUTO_CREATED_FIELDS = new HashSet<>(Collections.singletonList(ML_CATEGORY_FIELD)); | |||
|
|||
// Since the C++ backend truncates the categorization field at length 1000, adding an ellipsis on truncation, it makes no sense to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also mention it's in model::CCategoryExamplesCollector::MAX_EXAMPLE_LENGTH
in case someone needs to tie up the two sides in the future.
// send potentially very long strings to it. For the backend logic still to work we need to send more than that, hence we truncate | ||
// at length 1001. | ||
// | ||
// Also, because we do the tokenization on the Java side now the tokens will still be sent correctly (separately) to the C++ backend |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the time has come to delete the constant and the 6 places where it's used:
elasticsearch/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/MachineLearning.java
Line 487 in 9e530e1
public static final boolean CATEGORIZATION_TOKENIZATION_IN_JAVA = true; |
Then this comment really will be true always.
…elasticsearch into truncate_categorization_fields
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Truncate the raw categorization field passed to the backend at 1001 characters.
* main: (176 commits) Fix RandomSamplerAggregatorTests testAggregationSamplingNestedAggsScaled test failure (elastic#89958) [Downsampling] Replace document map with SMILE encoded doc (elastic#89495) Remove full cluster state from error logging in MasterService (elastic#89960) [ML] Truncate categorization fields (elastic#89827) [TSDB] Removed `summary` and `histogram` metric types (elastic#89937) Update testNodeSelectorRouting so that it does not depend on iteration order (elastic#89879) Make sure listener is resolved when file queue is cleared (elastic#89929) [Stable plugin api] Extensible annotation (elastic#89903) Fix double sending of response in TransportOpenIdConnectPrepareAuthenticationAction (elastic#89930) Make sure ivy repo directory exists before downloading artifacts Use 'file://' scheme for local repository URL Use DRA artifacts for release build CI jobs Log unsuccessful attempts to get credentials from web identity tokens (elastic#88241) Script: Write Field API path manipulation (elastic#89889) Fetch health info action (elastic#89820) Fix memory leak in TransportDeleteExpiredDataAction (elastic#89935) [ML] Performance improvements for categorization jobs (elastic#89824) [DOCS] Revert changes for ES_JAVA_OPTS (elastic#89931) Fix deadlock bug exposed by a test (elastic#89934) [Downsampling] Remove `FieldValueFetcher` validator (elastic#89497) ...
Truncate the raw categorization field passed to the backend at 1001 characters. This prevents excessively long messages being sent to the C++ backend that will only be truncated there anyway.