-
Notifications
You must be signed in to change notification settings - Fork 25.4k
[ML] Add field stats to log structure finder #33351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The log structure endpoint will return these in addition to pure structure information so that it can be used to drive pre-import data visualizer functionality. The statistics for every field are count, cardinality (distinct count) and top hits (most common values). Extra statistics are calculated if the field is numeric: min, max, mean and median.
Pinging @elastic/ml-core |
As with the other log structure functionality so far, this PR is marked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Left a few comments, the biggest one being a design thought only to be adopted if it sounds appealing.
*/ | ||
public class FieldStatsCalculator { | ||
|
||
private int count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the count be a long
throughout? It seems like it could overflow. Perhaps cardinality as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cardinality cannot overflow an int
because it's being calculated as a collection size and Collection.size()
is an int
.
If we were going to be dealing with huge counts or cardinalities then a set of all distinct values would be the wrong data structure - we'd want some sort of sketch. But I had in my mind that this class would only be used for the log structure finder where the count is relatively small.
I'll make count a long
but add a comment to the class to make clear that it's not designed for huge cardinalities.
fieldValues.stream().flatMap(LogStructureUtils::flatten).collect(Collectors.toList())); | ||
} | ||
|
||
return guessScalarMapping(explanation, fieldName, fieldValues.stream().map(Object::toString).collect(Collectors.toList())); | ||
Collection<String> fieldValuesAsStrings = fieldValues.stream().map(Object::toString).collect(Collectors.toList()); | ||
return new Tuple<>(guessScalarMapping(explanation, fieldName, fieldValuesAsStrings), calculateFieldStats(fieldValuesAsStrings)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have the mapping here, I wonder if we should split the field stats calculator to a string one and a numeric one and instantiate the right one based on the mapping. That way the field stats calculator won't have to keep track of counts for the field both as string and numeric and in theory it opens the design for calculating stats on different types as well in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered this but then thought it would be nice if the class could be used in other situations where we don't know the mapping in advance. I guess we can revisit this based on future experiences.
|
||
for (Map.Entry<T, Integer> entry : sortedByCount) { | ||
|
||
Map<String, Object> topHit = new LinkedHashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set capacity to 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make it 3
because otherwise the hash map will reallocate its hash table once it has 2 items in it as it doesn't want to be 100% loaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The log structure endpoint will return these in addition to pure structure information so that it can be used to drive pre-import data visualizer functionality. The statistics for every field are count, cardinality (distinct count) and top hits (most common values). Extra statistics are calculated if the field is numeric: min, max, mean and median.
* master: Fix deprecated setting specializations (elastic#33412) HLRC: split cluster request converters (elastic#33400) HLRC: Add ML get influencers API (elastic#33389) Add conditional token filter to elasticsearch (elastic#31958) Build: Merge xpack checkstyle config into core (elastic#33399) Disable IndexRecoveryIT.testRerouteRecovery. INGEST: Implement Drop Processor (elastic#32278) [ML] Add field stats to log structure finder (elastic#33351) Add interval response parameter to AutoDateInterval histogram (elastic#33254) MINOR+CORE: Remove Dead Methods ClusterService (elastic#33346)
The log structure endpoint will return these in addition to
pure structure information so that it can be used to drive
pre-import data visualizer functionality.
The statistics for every field are count, cardinality
(distinct count) and top hits (most common values). Extra
statistics are calculated if the field is numeric: min, max,
mean and median.