Skip to content

[ML] Adjust structure finder for Joda to Java time migration #37306

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 33 additions & 31 deletions docs/reference/ml/apis/find-file-structure.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -164,37 +164,40 @@ format corresponds to the primary timestamp, but you do not want to specify the
full `grok_pattern`.

If this parameter is not specified, the structure finder chooses the best format from
the formats it knows, which are these Joda formats and their Java time equivalents:

* `dd/MMM/YYYY:HH:mm:ss Z`
* `EEE MMM dd HH:mm zzz YYYY`
* `EEE MMM dd HH:mm:ss YYYY`
* `EEE MMM dd HH:mm:ss zzz YYYY`
* `EEE MMM dd YYYY HH:mm zzz`
* `EEE MMM dd YYYY HH:mm:ss zzz`
* `EEE, dd MMM YYYY HH:mm Z`
* `EEE, dd MMM YYYY HH:mm ZZ`
* `EEE, dd MMM YYYY HH:mm:ss Z`
* `EEE, dd MMM YYYY HH:mm:ss ZZ`
the formats it knows, which are these Java time formats and their Joda equivalents:

* `dd/MMM/yyyy:HH:mm:ss XX`
* `EEE MMM dd HH:mm zzz yyyy`
* `EEE MMM dd HH:mm:ss yyyy`
* `EEE MMM dd HH:mm:ss zzz yyyy`
* `EEE MMM dd yyyy HH:mm zzz`
* `EEE MMM dd yyyy HH:mm:ss zzz`
* `EEE, dd MMM yyyy HH:mm XX`
* `EEE, dd MMM yyyy HH:mm XXX`
* `EEE, dd MMM yyyy HH:mm:ss XX`
* `EEE, dd MMM yyyy HH:mm:ss XXX`
* `ISO8601`
* `MMM d HH:mm:ss`
* `MMM d HH:mm:ss,SSS`
* `MMM d YYYY HH:mm:ss`
* `MMM d yyyy HH:mm:ss`
* `MMM dd HH:mm:ss`
* `MMM dd HH:mm:ss,SSS`
* `MMM dd YYYY HH:mm:ss`
* `MMM dd, YYYY h:mm:ss a`
* `MMM dd yyyy HH:mm:ss`
* `MMM dd, yyyy h:mm:ss a`
* `TAI64N`
* `UNIX`
* `UNIX_MS`
* `YYYY-MM-dd HH:mm:ss`
* `YYYY-MM-dd HH:mm:ss,SSS`
* `YYYY-MM-dd HH:mm:ss,SSS Z`
* `YYYY-MM-dd HH:mm:ss,SSSZ`
* `YYYY-MM-dd HH:mm:ss,SSSZZ`
* `YYYY-MM-dd HH:mm:ssZ`
* `YYYY-MM-dd HH:mm:ssZZ`
* `YYYYMMddHHmmss`
* `yyyy-MM-dd HH:mm:ss`
* `yyyy-MM-dd HH:mm:ss,SSS`
* `yyyy-MM-dd HH:mm:ss,SSS XX`
* `yyyy-MM-dd HH:mm:ss,SSSXX`
* `yyyy-MM-dd HH:mm:ss,SSSXXX`
* `yyyy-MM-dd HH:mm:ssXX`
* `yyyy-MM-dd HH:mm:ssXXX`
* `yyyy-MM-dd'T'HH:mm:ss,SSS`
* `yyyy-MM-dd'T'HH:mm:ss,SSSXX`
* `yyyy-MM-dd'T'HH:mm:ss,SSSXXX`
* `yyyyMMddHHmmss`

--

Expand Down Expand Up @@ -603,11 +606,11 @@ If the request does not encounter errors, you receive the following result:
},
"tpep_dropoff_datetime" : {
"type" : "date",
"format" : "YYYY-MM-dd HH:mm:ss"
"format" : "8yyyy-MM-dd HH:mm:ss"
},
"tpep_pickup_datetime" : {
"type" : "date",
"format" : "YYYY-MM-dd HH:mm:ss"
"format" : "8yyyy-MM-dd HH:mm:ss"
},
"trip_distance" : {
"type" : "double"
Expand All @@ -621,7 +624,7 @@ If the request does not encounter errors, you receive the following result:
"field" : "tpep_pickup_datetime",
"timezone" : "{{ beat.timezone }}",
"formats" : [
"YYYY-MM-dd HH:mm:ss"
"8yyyy-MM-dd HH:mm:ss"
]
}
}
Expand Down Expand Up @@ -1287,10 +1290,9 @@ If the request does not encounter errors, you receive the following result:
was chosen because it comes first in the column order. If you prefer
`tpep_dropoff_datetime` then force it to be chosen using the
`timestamp_field` query parameter.
<8> `joda_timestamp_formats` are used to tell Logstash and Ingest pipeline how
to parse timestamps.
<8> `joda_timestamp_formats` are used to tell Logstash how to parse timestamps.
<9> `java_timestamp_formats` are the Java time formats recognized in the time
fields. In future Ingest pipeline will switch to use this format.
fields. Elasticsearch mappings and Ingest pipeline use this format.
<10> The timestamp format in this sample doesn't specify a timezone, so to
accurately convert them to UTC timestamps to store in Elasticsearch it's
necessary to supply the timezone they relate to. `need_client_timezone`
Expand Down Expand Up @@ -1396,7 +1398,7 @@ this:
"field" : "timestamp",
"timezone" : "{{ beat.timezone }}",
"formats" : [
"ISO8601"
"8yyyy-MM-dd'T'HH:mm:ss,SSS"
]
}
},
Expand Down Expand Up @@ -1556,7 +1558,7 @@ this:
"field" : "timestamp",
"timezone" : "{{ beat.timezone }}",
"formats" : [
"ISO8601"
"8yyyy-MM-dd'T'HH:mm:ss,SSS"
]
}
},
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,7 @@ static DelimitedFileStructureFinder makeDelimitedFileStructureFinder(List<String
.setJavaTimestampFormats(timeField.v2().javaTimestampFormats)
.setNeedClientTimezone(needClientTimeZone)
.setIngestPipeline(FileStructureUtils.makeIngestPipelineDefinition(null, timeField.v1(),
timeField.v2().jodaTimestampFormats, needClientTimeZone))
timeField.v2().javaTimestampFormats, needClientTimeZone))
.setMultilineStartPattern(timeLineRegex);
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -353,7 +353,7 @@ public static Map<String, Object> makeIngestPipelineDefinition(String grokPatter
if (needClientTimezone) {
dateProcessorSettings.put("timezone", "{{ " + BEAT_TIMEZONE_FIELD + " }}");
}
dateProcessorSettings.put("formats", timestampFormats);
dateProcessorSettings.put("formats", jodaBwcJavaTimestampFormatsForIngestPipeline(timestampFormats));
processors.add(Collections.singletonMap("date", dateProcessorSettings));
}

Expand All @@ -365,4 +365,19 @@ public static Map<String, Object> makeIngestPipelineDefinition(String grokPatter
pipeline.put(Pipeline.PROCESSORS_KEY, processors);
return pipeline;
}

// TODO: remove this method when Java time formats are the default
static List<String> jodaBwcJavaTimestampFormatsForIngestPipeline(List<String> javaTimestampFormats) {
return javaTimestampFormats.stream().map(format -> {
switch (format) {
case "ISO8601":
case "UNIX_MS":
case "UNIX":
case "TAI64N":
return format;
default:
return "8" + format;
}
}).collect(Collectors.toList());
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ static NdJsonFileStructureFinder makeNdJsonFileStructureFinder(List<String> expl
.setJavaTimestampFormats(timeField.v2().javaTimestampFormats)
.setNeedClientTimezone(needClientTimeZone)
.setIngestPipeline(FileStructureUtils.makeIngestPipelineDefinition(null, timeField.v1(),
timeField.v2().jodaTimestampFormats, needClientTimeZone));
timeField.v2().javaTimestampFormats, needClientTimeZone));
}

Tuple<SortedMap<String, Object>, SortedMap<String, FieldStats>> mappingsAndFieldStats =
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@ static TextLogFileStructureFinder makeTextLogFileStructureFinder(List<String> ex
.setNeedClientTimezone(needClientTimeZone)
.setGrokPattern(grokPattern)
.setIngestPipeline(FileStructureUtils.makeIngestPipelineDefinition(grokPattern, interimTimestampField,
bestTimestamp.v1().jodaTimestampFormats, needClientTimeZone))
bestTimestamp.v1().javaTimestampFormats, needClientTimeZone))
.setMappings(mappings)
.setFieldStats(fieldStats)
.setExplanation(explanation)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -457,13 +457,13 @@ public boolean hasTimezoneDependentParsing() {
* and possibly also a "format" setting.
*/
public Map<String, String> getEsDateMappingTypeWithFormat() {
if (jodaTimestampFormats.contains("TAI64N")) {
if (javaTimestampFormats.contains("TAI64N")) {
// There's no format for TAI64N in the timestamp formats used in mappings
return Collections.singletonMap(FileStructureUtils.MAPPING_TYPE_SETTING, "keyword");
}
Map<String, String> mapping = new LinkedHashMap<>();
mapping.put(FileStructureUtils.MAPPING_TYPE_SETTING, "date");
String formats = jodaTimestampFormats.stream().flatMap(format -> {
String formats = javaTimestampFormats.stream().flatMap(format -> {
switch (format) {
case "ISO8601":
return Stream.empty();
Expand All @@ -472,7 +472,8 @@ public Map<String, String> getEsDateMappingTypeWithFormat() {
case "UNIX":
return Stream.of("epoch_second");
default:
return Stream.of(format);
// TODO: remove the "8" prefix when Java time formats are the default
return Stream.of("8" + format);
}
}).collect(Collectors.joining("||"));
if (formats.isEmpty() == false) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ static XmlFileStructureFinder makeXmlFileStructureFinder(List<String> explanatio
.setJavaTimestampFormats(timeField.v2().javaTimestampFormats)
.setNeedClientTimezone(needClientTimeZone)
.setIngestPipeline(FileStructureUtils.makeIngestPipelineDefinition(null, topLevelTag + "." + timeField.v1(),
timeField.v2().jodaTimestampFormats, needClientTimeZone));
timeField.v2().javaTimestampFormats, needClientTimeZone));
}

Tuple<SortedMap<String, Object>, SortedMap<String, FieldStats>> mappingsAndFieldStats =
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ public void testGuessTimestampGivenSingleSampleSingleField() {
EMPTY_OVERRIDES, NOOP_TIMEOUT_CHECKER);
assertNotNull(match);
assertEquals("field1", match.v1());
assertThat(match.v2().jodaTimestampFormats, contains("ISO8601"));
assertThat(match.v2().javaTimestampFormats, contains("yyyy-MM-dd'T'HH:mm:ss,SSS"));
assertEquals("TIMESTAMP_ISO8601", match.v2().grokPatternName);
}

Expand All @@ -52,7 +52,7 @@ public void testGuessTimestampGivenSingleSampleSingleFieldAndConsistentTimeField
overrides, NOOP_TIMEOUT_CHECKER);
assertNotNull(match);
assertEquals("field1", match.v1());
assertThat(match.v2().jodaTimestampFormats, contains("ISO8601"));
assertThat(match.v2().javaTimestampFormats, contains("yyyy-MM-dd'T'HH:mm:ss,SSS"));
assertEquals("TIMESTAMP_ISO8601", match.v2().grokPatternName);
}

Expand All @@ -77,20 +77,20 @@ public void testGuessTimestampGivenSingleSampleSingleFieldAndConsistentTimeForma
overrides, NOOP_TIMEOUT_CHECKER);
assertNotNull(match);
assertEquals("field1", match.v1());
assertThat(match.v2().jodaTimestampFormats, contains("ISO8601"));
assertThat(match.v2().javaTimestampFormats, contains("yyyy-MM-dd'T'HH:mm:ss,SSS"));
assertEquals("TIMESTAMP_ISO8601", match.v2().grokPatternName);
}

public void testGuessTimestampGivenSingleSampleSingleFieldAndImpossibleTimeFormatOverride() {

FileStructureOverrides overrides = FileStructureOverrides.builder().setTimestampFormat("EEE MMM dd HH:mm:ss YYYY").build();
FileStructureOverrides overrides = FileStructureOverrides.builder().setTimestampFormat("EEE MMM dd HH:mm:ss yyyy").build();

Map<String, String> sample = Collections.singletonMap("field1", "2018-05-24T17:28:31,735");
IllegalArgumentException e = expectThrows(IllegalArgumentException.class,
() -> FileStructureUtils.guessTimestampField(explanation, Collections.singletonList(sample), overrides,
NOOP_TIMEOUT_CHECKER));

assertEquals("Specified timestamp format [EEE MMM dd HH:mm:ss YYYY] does not match for record [{field1=2018-05-24T17:28:31,735}]",
assertEquals("Specified timestamp format [EEE MMM dd HH:mm:ss yyyy] does not match for record [{field1=2018-05-24T17:28:31,735}]",
e.getMessage());
}

Expand All @@ -101,7 +101,7 @@ public void testGuessTimestampGivenSamplesWithSameSingleTimeField() {
EMPTY_OVERRIDES, NOOP_TIMEOUT_CHECKER);
assertNotNull(match);
assertEquals("field1", match.v1());
assertThat(match.v2().jodaTimestampFormats, contains("ISO8601"));
assertThat(match.v2().javaTimestampFormats, contains("yyyy-MM-dd'T'HH:mm:ss,SSS"));
assertEquals("TIMESTAMP_ISO8601", match.v2().grokPatternName);
}

Expand Down Expand Up @@ -130,7 +130,7 @@ public void testGuessTimestampGivenSingleSampleManyFieldsOneTimeFormat() {
EMPTY_OVERRIDES, NOOP_TIMEOUT_CHECKER);
assertNotNull(match);
assertEquals("time", match.v1());
assertThat(match.v2().jodaTimestampFormats, contains("YYYY-MM-dd HH:mm:ss,SSS"));
assertThat(match.v2().javaTimestampFormats, contains("yyyy-MM-dd HH:mm:ss,SSS"));
assertEquals("TIMESTAMP_ISO8601", match.v2().grokPatternName);
}

Expand All @@ -147,7 +147,7 @@ public void testGuessTimestampGivenSamplesWithManyFieldsSameSingleTimeFormat() {
EMPTY_OVERRIDES, NOOP_TIMEOUT_CHECKER);
assertNotNull(match);
assertEquals("time", match.v1());
assertThat(match.v2().jodaTimestampFormats, contains("YYYY-MM-dd HH:mm:ss,SSS"));
assertThat(match.v2().javaTimestampFormats, contains("yyyy-MM-dd HH:mm:ss,SSS"));
assertEquals("TIMESTAMP_ISO8601", match.v2().grokPatternName);
}

Expand Down Expand Up @@ -178,7 +178,7 @@ public void testGuessTimestampGivenSamplesWithManyFieldsSameSingleTimeFormatDist
EMPTY_OVERRIDES, NOOP_TIMEOUT_CHECKER);
assertNotNull(match);
assertEquals("time", match.v1());
assertThat(match.v2().jodaTimestampFormats, contains("YYYY-MM-dd HH:mm:ss,SSS"));
assertThat(match.v2().javaTimestampFormats, contains("yyyy-MM-dd HH:mm:ss,SSS"));
assertEquals("TIMESTAMP_ISO8601", match.v2().grokPatternName);
}

Expand All @@ -195,7 +195,7 @@ public void testGuessTimestampGivenSamplesWithManyFieldsSameSingleTimeFormatDist
EMPTY_OVERRIDES, NOOP_TIMEOUT_CHECKER);
assertNotNull(match);
assertEquals("time", match.v1());
assertThat(match.v2().jodaTimestampFormats, contains("MMM dd YYYY HH:mm:ss", "MMM d YYYY HH:mm:ss"));
assertThat(match.v2().javaTimestampFormats, contains("MMM dd yyyy HH:mm:ss", "MMM d yyyy HH:mm:ss"));
assertEquals("CISCOTIMESTAMP", match.v2().grokPatternName);
}

Expand Down Expand Up @@ -228,7 +228,7 @@ public void testGuessTimestampGivenSamplesWithManyFieldsInconsistentAndConsisten
EMPTY_OVERRIDES, NOOP_TIMEOUT_CHECKER);
assertNotNull(match);
assertEquals("time2", match.v1());
assertThat(match.v2().jodaTimestampFormats, contains("MMM dd YYYY HH:mm:ss", "MMM d YYYY HH:mm:ss"));
assertThat(match.v2().javaTimestampFormats, contains("MMM dd yyyy HH:mm:ss", "MMM d yyyy HH:mm:ss"));
assertEquals("CISCOTIMESTAMP", match.v2().grokPatternName);
}

Expand Down Expand Up @@ -331,7 +331,8 @@ public void testGuessMappingsAndCalculateFieldStats() {
assertEquals(Collections.singletonMap(FileStructureUtils.MAPPING_TYPE_SETTING, "keyword"), mappings.get("foo"));
Map<String, String> expectedTimeMapping = new HashMap<>();
expectedTimeMapping.put(FileStructureUtils.MAPPING_TYPE_SETTING, "date");
expectedTimeMapping.put(FileStructureUtils.MAPPING_FORMAT_SETTING, "YYYY-MM-dd HH:mm:ss,SSS");
// TODO: remove the "8" prefix when Java time formats are the default
expectedTimeMapping.put(FileStructureUtils.MAPPING_FORMAT_SETTING, "8" + "yyyy-MM-dd HH:mm:ss,SSS");
assertEquals(expectedTimeMapping, mappings.get("time"));
assertEquals(Collections.singletonMap(FileStructureUtils.MAPPING_TYPE_SETTING, "long"), mappings.get("bar"));
assertNull(mappings.get("nothing"));
Expand All @@ -354,7 +355,7 @@ public void testMakeIngestPipelineDefinitionGivenStructuredWithoutTimestamp() {
public void testMakeIngestPipelineDefinitionGivenStructuredWithTimestamp() {

String timestampField = randomAlphaOfLength(10);
List<String> timestampFormats = randomFrom(TimestampFormatFinder.ORDERED_CANDIDATE_FORMATS).jodaTimestampFormats;
List<String> timestampFormats = randomFrom(TimestampFormatFinder.ORDERED_CANDIDATE_FORMATS).javaTimestampFormats;
boolean needClientTimezone = randomBoolean();

Map<String, Object> pipeline =
Expand All @@ -371,7 +372,8 @@ public void testMakeIngestPipelineDefinitionGivenStructuredWithTimestamp() {
assertNotNull(dateProcessor);
assertEquals(timestampField, dateProcessor.get("field"));
assertEquals(needClientTimezone, dateProcessor.containsKey("timezone"));
assertEquals(timestampFormats, dateProcessor.get("formats"));
// TODO: remove the call to jodaBwcJavaTimestampFormatsForIngestPipeline() when Java time formats are the default
assertEquals(FileStructureUtils.jodaBwcJavaTimestampFormatsForIngestPipeline(timestampFormats), dateProcessor.get("formats"));

// After removing the two expected fields there should be nothing left in the pipeline
assertEquals(Collections.emptyMap(), pipeline);
Expand All @@ -382,7 +384,7 @@ public void testMakeIngestPipelineDefinitionGivenSemiStructured() {

String grokPattern = randomAlphaOfLength(100);
String timestampField = randomAlphaOfLength(10);
List<String> timestampFormats = randomFrom(TimestampFormatFinder.ORDERED_CANDIDATE_FORMATS).jodaTimestampFormats;
List<String> timestampFormats = randomFrom(TimestampFormatFinder.ORDERED_CANDIDATE_FORMATS).javaTimestampFormats;
boolean needClientTimezone = randomBoolean();

Map<String, Object> pipeline =
Expand All @@ -404,7 +406,8 @@ public void testMakeIngestPipelineDefinitionGivenSemiStructured() {
assertNotNull(dateProcessor);
assertEquals(timestampField, dateProcessor.get("field"));
assertEquals(needClientTimezone, dateProcessor.containsKey("timezone"));
assertEquals(timestampFormats, dateProcessor.get("formats"));
// TODO: remove the call to jodaBwcJavaTimestampFormatsForIngestPipeline() when Java time formats are the default
assertEquals(FileStructureUtils.jodaBwcJavaTimestampFormatsForIngestPipeline(timestampFormats), dateProcessor.get("formats"));

Map<String, Object> removeProcessor = (Map<String, Object>) processors.get(2).get("remove");
assertNotNull(removeProcessor);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -357,7 +357,7 @@ public void testMostLikelyTimestampGivenExceptionTrace() {

public void testMostLikelyTimestampGivenExceptionTraceAndTimestampFormatOverride() {

FileStructureOverrides overrides = FileStructureOverrides.builder().setTimestampFormat("YYYY-MM-dd HH:mm:ss").build();
FileStructureOverrides overrides = FileStructureOverrides.builder().setTimestampFormat("yyyy-MM-dd HH:mm:ss").build();

Tuple<TimestampMatch, Set<String>> mostLikelyMatch =
TextLogFileStructureFinder.mostLikelyTimestamp(EXCEPTION_TRACE_SAMPLE.split("\n"), overrides, NOOP_TIMEOUT_CHECKER);
Expand Down
Loading