Skip to content

[ML] Add field stats to log structure finder #33351

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 5, 2018

Conversation

droberts195
Copy link
Contributor

The log structure endpoint will return these in addition to
pure structure information so that it can be used to drive
pre-import data visualizer functionality.

The statistics for every field are count, cardinality
(distinct count) and top hits (most common values). Extra
statistics are calculated if the field is numeric: min, max,
mean and median.

The log structure endpoint will return these in addition to
pure structure information so that it can be used to drive
pre-import data visualizer functionality.

The statistics for every field are count, cardinality
(distinct count) and top hits (most common values).  Extra
statistics are calculated if the field is numeric: min, max,
mean and median.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@droberts195
Copy link
Contributor Author

As with the other log structure functionality so far, this PR is marked >non-issue as it's currently dead code. It will be documented when the endpoint is added.

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Left a few comments, the biggest one being a design thought only to be adopted if it sounds appealing.

*/
public class FieldStatsCalculator {

private int count;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the count be a long throughout? It seems like it could overflow. Perhaps cardinality as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cardinality cannot overflow an int because it's being calculated as a collection size and Collection.size() is an int.

If we were going to be dealing with huge counts or cardinalities then a set of all distinct values would be the wrong data structure - we'd want some sort of sketch. But I had in my mind that this class would only be used for the log structure finder where the count is relatively small.

I'll make count a long but add a comment to the class to make clear that it's not designed for huge cardinalities.

fieldValues.stream().flatMap(LogStructureUtils::flatten).collect(Collectors.toList()));
}

return guessScalarMapping(explanation, fieldName, fieldValues.stream().map(Object::toString).collect(Collectors.toList()));
Collection<String> fieldValuesAsStrings = fieldValues.stream().map(Object::toString).collect(Collectors.toList());
return new Tuple<>(guessScalarMapping(explanation, fieldName, fieldValuesAsStrings), calculateFieldStats(fieldValuesAsStrings));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have the mapping here, I wonder if we should split the field stats calculator to a string one and a numeric one and instantiate the right one based on the mapping. That way the field stats calculator won't have to keep track of counts for the field both as string and numeric and in theory it opens the design for calculating stats on different types as well in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered this but then thought it would be nice if the class could be used in other situations where we don't know the mapping in advance. I guess we can revisit this based on future experiences.


for (Map.Entry<T, Integer> entry : sortedByCount) {

Map<String, Object> topHit = new LinkedHashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set capacity to 2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make it 3 because otherwise the hash map will reallocate its hash table once it has 2 items in it as it doesn't want to be 100% loaded.

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@droberts195 droberts195 merged commit a296829 into elastic:master Sep 5, 2018
@droberts195 droberts195 deleted the add_field_stats branch September 5, 2018 11:57
droberts195 added a commit that referenced this pull request Sep 5, 2018
The log structure endpoint will return these in addition to
pure structure information so that it can be used to drive
pre-import data visualizer functionality.

The statistics for every field are count, cardinality
(distinct count) and top hits (most common values).  Extra
statistics are calculated if the field is numeric: min, max,
mean and median.
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Sep 5, 2018
* master:
  Fix deprecated setting specializations (elastic#33412)
  HLRC: split cluster request converters (elastic#33400)
  HLRC: Add ML get influencers API (elastic#33389)
  Add conditional token filter to elasticsearch (elastic#31958)
  Build: Merge xpack checkstyle config into core (elastic#33399)
  Disable IndexRecoveryIT.testRerouteRecovery.
  INGEST: Implement Drop Processor (elastic#32278)
  [ML] Add field stats to log structure finder (elastic#33351)
  Add interval response parameter to AutoDateInterval histogram (elastic#33254)
  MINOR+CORE: Remove Dead Methods ClusterService (elastic#33346)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants