Json formatted datetime parsing #1301

sum12 · 2022-02-11T18:31:03Z

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

The PR reuses the csv parser implementation for date64 (which implements date64 parsing using a format string) for json reader. However the json reader derives its values from the fields metadata. This is a different approach as compared to the csv parser and is more flexible as it gives the ability to have different format string for each field. With the need to have additional attr added to decoder/reader.

Are there any user-facing changes?

There is no change in the existing parsing flow, however Fields' metadata's key (format_string) is being hijacked for internal purposes. Not sure where to document this.

codecov-commenter · 2022-02-12T04:30:17Z

Codecov Report

Merging #1301 (2d47602) into master (8f7c56e) will increase coverage by 0.03%.
The diff coverage is 70.21%.

@@            Coverage Diff             @@
##           master    #1301      +/-   ##
==========================================
+ Coverage   83.01%   83.05%   +0.03%     
==========================================
  Files         180      181       +1     
  Lines       52731    52873     +142     
==========================================
+ Hits        43775    43912     +137     
- Misses       8956     8961       +5

Impacted Files	Coverage Δ
arrow/src/csv/reader.rs	`89.42% <ø> (+1.30%)`	⬆️
arrow/src/util/reader_parser.rs	`57.77% <57.77%> (ø)`
arrow/src/json/reader.rs	`83.46% <81.63%> (+0.07%)`	⬆️
arrow/src/datatypes/datatype.rs	`66.40% <0.00%> (-0.40%)`	⬇️
parquet/src/arrow/converter.rs	`63.96% <0.00%> (-0.39%)`	⬇️
parquet_derive/src/parquet_field.rs	`65.98% <0.00%> (-0.23%)`	⬇️
arrow/src/array/transform/mod.rs	`84.51% <0.00%> (-0.13%)`	⬇️
arrow/src/array/array_union.rs	`90.71% <0.00%> (-0.05%)`	⬇️
parquet/src/arrow/arrow_writer.rs	`97.56% <0.00%> (-0.02%)`	⬇️
arrow/src/array/builder.rs	`86.73% <0.00%> (-0.01%)`	⬇️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8f7c56e...2d47602. Read the comment docs.

arrow/src/json/reader.rs

alamb

Thanks @sum12

arrow/src/json/reader.rs

Dandandan · 2022-02-13T21:14:58Z

arrow/src/util/reader_parser.rs

+        }
+    }
+
+    fn parse_formatted(string: &str, format: &str) -> Option<i32> {


For performance, it probably makes sense to reuse the formatter between values, by e.g. using the format function: https://docs.rs/chrono/latest/chrono/naive/struct.NaiveDate.html#method.format

Storing intermediate results is definitely a good idea. However not sure how to do such things in rust. Each DataType would have a different type for intermediate result. But parse_formatted (or anything new) would expect a unified interface. All these types could be unified under a enum, is that correct ? Not sure if that is what should be done, also it would be a breaking API change ?

ah, i think there is a mis-understanding here.

The method is not printing the date but parsing the string using the given format. I dont think https://docs.rs/chrono/latest/chrono/naive/struct.NaiveDate.html#method.format can be used here.

Perhaps what @Dandandan was alluding to was to try and avoid the cost of parsing the format string for each element (e.g. try and reuse StrftimeItems somehow rather than recreating them all the time in https://docs.rs/chrono/latest/src/chrono/datetime.rs.html#386-390)

However, it is probably worth nothing that master already does the "parse the format string each time" so it is probably ok to leave that behavior in this PR (we could file a follow on issue for someone to look at if they are interested)

sum12 · 2022-02-28T08:41:27Z

Bump :-)

alamb · 2022-02-28T14:34:30Z

Sorry @sum12 -- I have been away for this last week. I'll put this PR on my queue to review more carefully shortly

alamb

Sorry for the delay @sum12

I think the core of this PR is reasonable.

I am concerned about using an undocumented "format_string" in the schema metadata to control parsing of dates, especially since
https://docs.rs/arrow/9.1.0/arrow/csv/reader/struct.Reader.html#method.new already has a

datetime_format: Option<String>

parameter so this PR would make the Json and CSV reader interfaces inconsistent.

Have you considered extending JsonReader::new() to take a map of field name to format string instead?

arrow/src/csv/reader.rs

arrow/src/json/reader.rs

alamb · 2022-02-28T22:15:05Z

arrow/src/json/reader.rs

        ))
    }

+    #[allow(clippy::unnecessary_wraps)]


I know this is just copy/pasted from build_primitive_array, but I think we could follow this clippy lint rather than ignore it (perhaps as a follow on PR)

alamb · 2022-02-28T22:20:52Z

arrow/src/util/reader_parser.rs

+        }
+    }
+
+    fn parse_formatted(string: &str, format: &str) -> Option<i32> {


Perhaps what @Dandandan was alluding to was to try and avoid the cost of parsing the format string for each element (e.g. try and reuse StrftimeItems somehow rather than recreating them all the time in https://docs.rs/chrono/latest/src/chrono/datetime.rs.html#386-390)

However, it is probably worth nothing that master already does the "parse the format string each time" so it is probably ok to leave that behavior in this PR (we could file a follow on issue for someone to look at if they are interested)

arrow/src/json/reader.rs

alamb · 2022-02-28T22:25:27Z

arrow/src/util/reader_parser.rs

+                use chrono::format::Fixed;
+                use chrono::format::StrftimeItems;
+                let fmt = StrftimeItems::new(format);
+                let has_zone = fmt.into_iter().any(|item| match item {


I realize this PR just moves this code around, but I wonder if we could reuse string_to_timestamp_nanos as a follow on PR

https://sourcegraph.com/github.com/apache/arrow-rs/-/blob/arrow/src/compute/kernels/cast_utils.rs?L69&subtree=true

arrow/src/util/reader_parser.rs

sum12 · 2022-03-02T16:54:45Z

Awesome, thanks for the idea of extending json::Reader::new(). That way metadata remains free of pollution. I wonder if there there a better way to handle breaking API changes

sum12 · 2022-03-02T18:07:49Z

Moved the code shuffling part to a different PR. Hope it helps

sum12 · 2022-03-15T08:05:29Z

Updated the PR with format_strings map parameter

the format_string map's key is column name. The value will be used to parse the date64/date32 types from json if the read value is of string type add tests for formatted parser for date{32,64}type for json readers

sum12 · 2022-03-16T12:48:30Z

@alamb, In hopes on making it ligher to review, created a new PR. It probably is cleaner than this one.

alamb · 2022-03-29T18:34:13Z

Superceded by #1451 so closing this one

github-actions bot added the arrow Changes to the arrow crate label Feb 11, 2022

sum12 force-pushed the json-formatted-datetime-parsing branch from 7e5ab0e to 6624bca Compare February 12, 2022 01:14

Dandandan reviewed Feb 13, 2022

View reviewed changes

arrow/src/json/reader.rs Outdated Show resolved Hide resolved

alamb reviewed Feb 13, 2022

View reviewed changes

arrow/src/json/reader.rs Show resolved Hide resolved

Dandandan reviewed Feb 13, 2022

View reviewed changes

alamb reviewed Feb 28, 2022

View reviewed changes

sum12 mentioned this pull request Mar 2, 2022

Move csv Parser trait and its implementations to utils module #1385

Merged

sum12 force-pushed the json-formatted-datetime-parsing branch 3 times, most recently from 642519c to 2795da9 Compare March 14, 2022 21:06

added format strings (hashmap) to json reader

d39e76c

the format_string map's key is column name. The value will be used to parse the date64/date32 types from json if the read value is of string type add tests for formatted parser for date{32,64}type for json readers

sum12 force-pushed the json-formatted-datetime-parsing branch from 2795da9 to d39e76c Compare March 16, 2022 08:52

sum12 mentioned this pull request Mar 16, 2022

Add Json DecoderOptions and support custom format_string for each field #1451

Merged

alamb closed this Mar 29, 2022

Json formatted datetime parsing #1301

Json formatted datetime parsing #1301

Uh oh!

Conversation

sum12 commented Feb 11, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

codecov-commenter commented Feb 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Dandandan Feb 13, 2022

Choose a reason for hiding this comment

Uh oh!

sum12 Feb 14, 2022

Choose a reason for hiding this comment

Uh oh!

sum12 Feb 16, 2022

Choose a reason for hiding this comment

Uh oh!

alamb Feb 28, 2022

Choose a reason for hiding this comment

Uh oh!

sum12 commented Feb 28, 2022

Uh oh!

alamb commented Feb 28, 2022

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alamb Feb 28, 2022

Choose a reason for hiding this comment

Uh oh!

alamb Feb 28, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alamb Feb 28, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sum12 commented Mar 2, 2022

Uh oh!

sum12 commented Mar 2, 2022

Uh oh!

sum12 commented Mar 15, 2022

Uh oh!

sum12 commented Mar 16, 2022

Uh oh!

alamb commented Mar 29, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Feb 12, 2022 •

edited

Loading