-
Notifications
You must be signed in to change notification settings - Fork 1k
Json formatted datetime parsing #1301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7e5ab0e to
6624bca
Compare
Codecov Report
@@ Coverage Diff @@
## master #1301 +/- ##
==========================================
+ Coverage 83.01% 83.05% +0.03%
==========================================
Files 180 181 +1
Lines 52731 52873 +142
==========================================
+ Hits 43775 43912 +137
- Misses 8956 8961 +5
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sum12
arrow/src/util/reader_parser.rs
Outdated
| } | ||
| } | ||
|
|
||
| fn parse_formatted(string: &str, format: &str) -> Option<i32> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For performance, it probably makes sense to reuse the formatter between values, by e.g. using the format function: https://docs.rs/chrono/latest/chrono/naive/struct.NaiveDate.html#method.format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storing intermediate results is definitely a good idea. However not sure how to do such things in rust. Each DataType would have a different type for intermediate result. But parse_formatted (or anything new) would expect a unified interface. All these types could be unified under a enum, is that correct ? Not sure if that is what should be done, also it would be a breaking API change ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, i think there is a mis-understanding here.
The method is not printing the date but parsing the string using the given format. I dont think https://docs.rs/chrono/latest/chrono/naive/struct.NaiveDate.html#method.format can be used here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps what @Dandandan was alluding to was to try and avoid the cost of parsing the format string for each element (e.g. try and reuse StrftimeItems somehow rather than recreating them all the time in https://docs.rs/chrono/latest/src/chrono/datetime.rs.html#386-390)
However, it is probably worth nothing that master already does the "parse the format string each time" so it is probably ok to leave that behavior in this PR (we could file a follow on issue for someone to look at if they are interested)
|
Bump :-) |
|
Sorry @sum12 -- I have been away for this last week. I'll put this PR on my queue to review more carefully shortly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay @sum12
I think the core of this PR is reasonable.
I am concerned about using an undocumented "format_string" in the schema metadata to control parsing of dates, especially since
https://docs.rs/arrow/9.1.0/arrow/csv/reader/struct.Reader.html#method.new already has a
datetime_format: Option<String>
parameter so this PR would make the Json and CSV reader interfaces inconsistent.
Have you considered extending JsonReader::new() to take a map of field name to format string instead?
arrow/src/json/reader.rs
Outdated
| )) | ||
| } | ||
|
|
||
| #[allow(clippy::unnecessary_wraps)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is just copy/pasted from build_primitive_array, but I think we could follow this clippy lint rather than ignore it (perhaps as a follow on PR)
arrow/src/util/reader_parser.rs
Outdated
| } | ||
| } | ||
|
|
||
| fn parse_formatted(string: &str, format: &str) -> Option<i32> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps what @Dandandan was alluding to was to try and avoid the cost of parsing the format string for each element (e.g. try and reuse StrftimeItems somehow rather than recreating them all the time in https://docs.rs/chrono/latest/src/chrono/datetime.rs.html#386-390)
However, it is probably worth nothing that master already does the "parse the format string each time" so it is probably ok to leave that behavior in this PR (we could file a follow on issue for someone to look at if they are interested)
arrow/src/util/reader_parser.rs
Outdated
| use chrono::format::Fixed; | ||
| use chrono::format::StrftimeItems; | ||
| let fmt = StrftimeItems::new(format); | ||
| let has_zone = fmt.into_iter().any(|item| match item { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize this PR just moves this code around, but I wonder if we could reuse string_to_timestamp_nanos as a follow on PR
|
Awesome, thanks for the idea of extending |
|
Moved the code shuffling part to a different PR. Hope it helps |
642519c to
2795da9
Compare
|
Updated the PR with |
the format_string map's key is column name.
The value will be used to parse the date64/date32 types from json
if the read value is of string type
add tests for formatted parser for date{32,64}type for json readers
2795da9 to
d39e76c
Compare
|
@alamb, In hopes on making it ligher to review, created a new PR. It probably is cleaner than this one. |
|
Superceded by #1451 so closing this one |
Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
The PR reuses the csv parser implementation for date64 (which implements date64 parsing using a format string) for json reader. However the json reader derives its values from the fields metadata. This is a different approach as compared to the csv parser and is more flexible as it gives the ability to have different format string for each field. With the need to have additional attr added to decoder/reader.
Are there any user-facing changes?
There is no change in the existing parsing flow, however Fields' metadata's key (
format_string) is being hijacked for internal purposes. Not sure where to document this.