[ML] Adjacency weighting fixes in categorization #2277

droberts195 · 2022-05-23T14:00:02Z

In #1903 we changed dictionary weighting in categorization to give
higher weighting when there were 3 or more adjacent dictionary
words. This was the first time that we'd ever had the situation
where the same token could have a different weight in different
messages. Unfortunately the way this interacted with us requiring
equal weights when checking for common tokens meant tokens could
be bizarrely removed from categories. For example, with the
following two messages we'd put them in the same category but say
that "started" was not a common token:

Service abcd was started
Service reaper was started

This happens because "abcd" is not a dictionary word but "reaper"
is, so then "started" has weight 6 in the first message but weight
31 in the second. Considering "started" to NOT be a common token
in this case is extremely bad both intuitively and for the accuracy
of drilldown searches.

Therefore this PR changes the categorization code to consider
tokens equal if their token IDs are equal but their weights are
different. Weights are now only used to compute distance between
different tokens.

This causes the need for another change. It is no longer as simple
as it used to be to calculate the highest and lowest possible total
weight of a message that might possibly be considered similar to
the current message. This calculation now needs to take account of
possible adjacency weighting, either in the current message or in
the messages being considered as matches. (This also has the side
effect that we'll do a higher number of expensive Levenshtein
distance calculations, as fewer potential matches will be discarded
early by the simple weight check.)

In elastic#1903 we changed dictionary weighting in categorization to give higher weighting when there were 3 or more adjacent dictionary words. This was the first time that we'd ever had the situation where the same token could have a different weight in different messages. Unfortunately the way this interacted with us requiring equal weights when checking for common tokens meant tokens could be bizarrely removed from categories. For example, with the following two messages we'd put them in the same category but say that "started" was not a common token: - Service abcd was started - Service reaper was started This happens because "abcd" is not a dictionary word but "reaper" is, so then "started" has weight 6 in the first message but weight 31 in the second. Considering "started" to NOT be a common token in this case is extremely bad both intuitively and for the accuracy of drilldown searches. Therefore this PR changes the categorization code to consider tokens equal if their token IDs are equal but their weights are different. Weights are now only used to compute distance between different tokens. This causes the need for another change. It is no longer as simple as it used to be to calculate the highest and lowest possible total weight of a message that might possibly be considered similar to the current message. This calculation now needs to take account of possible adjacency weighting, either in the current message or in the messages being considered as matches. (This also has the side effect that we'll do a higher number of expensive Levenshtein distance calculations, as fewer potential matches will be discarded early by the simple weight check.)

More changes to sync with elastic/ml-cpp#2277

droberts195 · 2022-05-23T15:05:55Z

lib/model/unittest/CTokenListDataCategorizerTest.cc

                                                    500));
-    BOOST_REQUIRE_EQUAL(ml::model::CLocalCategoryId{5},
+    BOOST_REQUIRE_EQUAL(ml::model::CLocalCategoryId{4},
                        categorizer.computeCategory(false,
-                                                    " [1111529792] INFO  session <45409105041220090733@192.168.251.123> - ----------------- PROXY "
-                                                    "Session DESTROYED --------------------",
+                                                    " [1111529792] INFO  session <45409105041220090733@192.168.251.123> - ----------------- "
+                                                    "PROXY Session DESTROYED --------------------",
                                                    500));
-    BOOST_REQUIRE_EQUAL(ml::model::CLocalCategoryId{6},
+    BOOST_REQUIRE_EQUAL(ml::model::CLocalCategoryId{4},
                        categorizer.computeCategory(false,
                                                    " [1094662464] INFO  session <ch6z1bho8xeprb3z4ty604iktl6c@dave.proxy.uk> - ----------------- "
                                                    "PROXY Session DESTROYED --------------------",


This is an example of an improvement from these changes. These two messages are very similar and intuitively should go in the same category. Previously they weren't because "PROXY" and "Session" are weighted differently.

droberts195 · 2022-05-23T15:07:44Z

lib/model/unittest/CTokenListDataCategorizerTest.cc

+    BOOST_REQUIRE_EQUAL(ml::model::CLocalCategoryId{1},
+                        categorizer.computeCategory(false, "combo ftpd[7045]: connection from 84.232.2.50 () at Mon Jan  9 23:44:50 2006",
+                                                    76));
+
+    BOOST_REQUIRE_EQUAL(ml::model::CLocalCategoryId{1},
+                        categorizer.computeCategory(false,
+                                                    "combo ftpd[6527]: connection from 60.45.101.89 "
+                                                    "(p15025-ipadfx01yosida.nagano.ocn.ne.jp) at Mon Jan  9 17:39:05 2006",
+                                                    115));


This is the example I saw on the Java side (while debugging elastic/elasticsearch#85872) that alerted me to the problem. Intuitively these messages are in the same category, but the old code would put them in different categories due to different weighting on the word "at".

droberts195 · 2022-05-23T15:18:30Z

lib/model/unittest/CTokenListDataCategorizerTest.cc

    BOOST_REQUIRE_EQUAL(ml::model::CLocalCategoryId{2},
                        categorizer.computeCategory(false, "<ml13-4608.1.p2ps: Info: > Source ML_SERVICE2 on 13122:867 has started.",
                                                    500));
-    BOOST_REQUIRE_EQUAL(ml::model::CLocalCategoryId{3},
+    BOOST_REQUIRE_EQUAL(ml::model::CLocalCategoryId{2},
                        categorizer.computeCategory(false, "<ml00-4201.1.p2ps: Info: > Service CUBE_CHIX, id of 132, has started.",
                                                    500));


This is a case that's not as good following the changes. Previously it was nice that the extra weight on "Service" and "Source" meant that we differentiated services starting from sources starting. After the adjacency weighting changes we only carried on doing that in this unit test because the token "has" was considered not to match between the two messages (because it has the higher adjacency weighting in the second message but not in the first). Now that "has" is considered a match, that plus the massively high weighted verb "started" puts the two messages in the same category.

It's not ideal but I think the changes of this PR are more justifiable than what we had from #1903.

In the long term I think we should try to do the following:

Have the tokenizer do the weighting rather than the categorizer - this will facilitate 2 and 3

Only give higher weighting to adjacent dictionary words if they were separated by whitespace in the original message, not discarded tokens

Having decided we have a sufficiently long run of adjacent dictionary words, give higher weighting to all of them, not just the 3rd onwards - this will be fairer when one of the first two words is important (like "Source" vs "Service" in this case

edsavage

LGTM

…85872) This replaces the implementation of the categorize_text aggregation with the new algorithm that was added in #80867. The new algorithm works in the same way as the ML C++ code used for categorization jobs (and now includes the fixes of elastic/ml-cpp#2277). The docs are updated to reflect the workings of the new implementation.

Windows.h can get included via Boost headers, so undefining min and max in our header isn't always enough. Luckily it turns out there's a NOMINMAX macro that can be defined globally to tame Windows.h.

In elastic#1903 we changed dictionary weighting in categorization to give higher weighting when there were 3 or more adjacent dictionary words. This was the first time that we'd ever had the situation where the same token could have a different weight in different messages. Unfortunately the way this interacted with us requiring equal weights when checking for common tokens meant tokens could be bizarrely removed from categories. For example, with the following two messages we'd put them in the same category but say that "started" was not a common token: - Service abcd was started - Service reaper was started This happens because "abcd" is not a dictionary word but "reaper" is, so then "started" has weight 6 in the first message but weight 31 in the second. Considering "started" to NOT be a common token in this case is extremely bad both intuitively and for the accuracy of drilldown searches. Therefore this PR changes the categorization code to consider tokens equal if their token IDs are equal but their weights are different. Weights are now only used to compute distance between different tokens. This causes the need for another change. It is no longer as simple as it used to be to calculate the highest and lowest possible total weight of a message that might possibly be considered similar to the current message. This calculation now needs to take account of possible adjacency weighting, either in the current message or in the messages being considered as matches. (This also has the side effect that we'll do a higher number of expensive Levenshtein distance calculations, as fewer potential matches will be discarded early by the simple weight check.) Backport of elastic#2277

In #1903 we changed dictionary weighting in categorization to give higher weighting when there were 3 or more adjacent dictionary words. This was the first time that we'd ever had the situation where the same token could have a different weight in different messages. Unfortunately the way this interacted with us requiring equal weights when checking for common tokens meant tokens could be bizarrely removed from categories. For example, with the following two messages we'd put them in the same category but say that "started" was not a common token: - Service abcd was started - Service reaper was started This happens because "abcd" is not a dictionary word but "reaper" is, so then "started" has weight 6 in the first message but weight 31 in the second. Considering "started" to NOT be a common token in this case is extremely bad both intuitively and for the accuracy of drilldown searches. Therefore this PR changes the categorization code to consider tokens equal if their token IDs are equal but their weights are different. Weights are now only used to compute distance between different tokens. This causes the need for another change. It is no longer as simple as it used to be to calculate the highest and lowest possible total weight of a message that might possibly be considered similar to the current message. This calculation now needs to take account of possible adjacency weighting, either in the current message or in the messages being considered as matches. (This also has the side effect that we'll do a higher number of expensive Levenshtein distance calculations, as fewer potential matches will be discarded early by the simple weight check.) Backport of #2277

droberts195 added >bug :ml affects-results v8.3.0 v7.17.5 v8.2.2 labels May 23, 2022

droberts195 pushed a commit to droberts195/elasticsearch that referenced this pull request May 23, 2022

Further weighting changes

7589387

More changes to sync with elastic/ml-cpp#2277

David Roberts added 2 commits May 23, 2022 16:19

Remove temporary debug

410b3d7

Add missing header

4a4184c

droberts195 requested a review from edsavage May 23, 2022 15:37

droberts195 commented May 23, 2022

View reviewed changes

droberts195 mentioned this pull request May 23, 2022

[ML] Replace the implementation of the categorize_text aggregation elastic/elasticsearch#85872

Merged

edsavage approved these changes May 23, 2022

View reviewed changes

More reliable way to avoid Microsoft min/max macros

7626495

Windows.h can get included via Boost headers, so undefining min and max in our header isn't always enough. Luckily it turns out there's a NOMINMAX macro that can be defined globally to tame Windows.h.

droberts195 merged commit b08e8cc into elastic:main May 23, 2022

droberts195 deleted the adjacent_word_weighting_fixes branch May 23, 2022 23:00

droberts195 mentioned this pull request May 24, 2022

[8.2] [ML] Adjacency weighting fixes in categorization #2278

Merged

droberts195 mentioned this pull request May 24, 2022

[7.17] [ML] Adjacency weighting fixes in categorization #2279

Merged

pheyos mentioned this pull request May 31, 2022

[ML] Functional tests - fix advanced job wizard tests elastic/kibana#133172

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Adjacency weighting fixes in categorization #2277

[ML] Adjacency weighting fixes in categorization #2277

Uh oh!

droberts195 commented May 23, 2022

Uh oh!

droberts195 May 23, 2022 •

edited

Loading

Uh oh!

droberts195 May 23, 2022

Uh oh!

droberts195 May 23, 2022 •

edited

Loading

Uh oh!

edsavage left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[ML] Adjacency weighting fixes in categorization #2277

[ML] Adjacency weighting fixes in categorization #2277

Uh oh!

Conversation

droberts195 commented May 23, 2022

Uh oh!

droberts195 May 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

droberts195 May 23, 2022

Choose a reason for hiding this comment

Uh oh!

droberts195 May 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edsavage left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

droberts195 May 23, 2022 •

edited

Loading

droberts195 May 23, 2022 •

edited

Loading