Infer and cache date field format instead of re-parsing it for every document #4558
Description
The date field for the default format uses high CPU during parsing. A huge portion of date formatting time(close to 7.12% of CPU time in profiles) goes into parsing, which generally happens when the date format is optional for certain segments. Our customers don’t often set the date parser, but rely on the unoptimized default one. When I changed the date parsing format to a strict one for the same data set, the indexing throughput increased by 8%.
For logs, the date format does not change across different log lines. Hence, it is pretty inefficient to compute the date format for every single document. For such users, we could infer and set a stricter date format after parsing a few documents.
Additionally, 7% CPU seems too high just for date parsing. Maybe Java formatter has improved since the time I ran these tests. CPU profile shows that the most time goes into parsing the optional segments for the date.
Solutions?
- We should definitely improve our documentation to clearly call out that the date mapping should be set to a stricter format if known well in advance.
- Infer the date time parsing format when it is set top optional and re-use it across requests?