Skip to content

[ML] Warn if ML categorization job is using data that does not categorize well #50749

Closed
@sophiec20

Description

@sophiec20

What
If an ML categorization job creates many many categories, it is probably not worth categorising. To be defensive, we should audit a warning message for jobs where the number of categories is high. This warning would be visible in job messages in the UI but would not be intended to stop the job from continuing.

It is difficult to figure out what "high" is because this is data dependent. This could be a ratio of categories to records_processed once a useful learning period has elapsed. Or it could be a hard upper limit on total number of categories (taking into account multiple partitions if they are configured). Or both.

Ideally this check can be performed in the early stages of the job after it has had a chance to analyze a useful amount of data. This could be at the end of a lookback (before starting real-time) or say after 100 buckets or 1 day (whichever sooner) for real-time only jobs.

Re-assessing this warning during the lifetime of a real-time job would also have some value in cases where the input data changes - however this could get annoying if done too frequently.

Why
Log categorization will group unstructured log messages into categories. For example, Fred accessed file bananas.txt and Wilma accessed file apples.txt would be considered the same message category. From here, you can use current anomaly detection to model and identify unusual counts of categories of log message and/or rare log message categories.

To create a ML categorization job, it requires a timestamp and a message field. Categorization works best on machine written log messages, typically logging written by a developer for the purpose of system troubleshooting. For example, we would get very poor results trying to categorize each sentence in the complete works of Shakespeare because sentences are different and do not share similar structure. However we would generally get good results if categorizing applications logs with repeated messages (where certain fields changing in each doc e.g. hostname, IP addr, username).

Consequently, an ML categorization job is worth using providing the data it is analyzing is suitable for categorizing. This is not necessarily immediately obvious to all potential users of the system, therefore we should attempt to warn users if the job is not categorizing well.

When
Log categorization has been part of ML anomaly detection for a long time, but has been a bit of a hidden feature. This is now changing.

In 7.6 (tbc) we are working on a new ML UI Wizard elastic/kibana#53009 which will make it easier to create categorization jobs. Logs UI Observability team are also working on integrating with ML elastic/kibana#53004.

With more visibility of the categorization feature, we should look at seeing how we can enhance its usability so users get a better experience of the functionality.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions