Skip to content

Conversation

@sum12
Copy link
Contributor

@sum12 sum12 commented Feb 11, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

The PR reuses the csv parser implementation for date64 (which implements date64 parsing using a format string) for json reader. However the json reader derives its values from the fields metadata. This is a different approach as compared to the csv parser and is more flexible as it gives the ability to have different format string for each field. With the need to have additional attr added to decoder/reader.

Are there any user-facing changes?

There is no change in the existing parsing flow, however Fields' metadata's key (format_string) is being hijacked for internal purposes. Not sure where to document this.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Feb 11, 2022
@sum12 sum12 force-pushed the json-formatted-datetime-parsing branch from 7e5ab0e to 6624bca Compare February 12, 2022 01:14
@codecov-commenter
Copy link

codecov-commenter commented Feb 12, 2022

Codecov Report

Merging #1301 (2d47602) into master (8f7c56e) will increase coverage by 0.03%.
The diff coverage is 70.21%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1301      +/-   ##
==========================================
+ Coverage   83.01%   83.05%   +0.03%     
==========================================
  Files         180      181       +1     
  Lines       52731    52873     +142     
==========================================
+ Hits        43775    43912     +137     
- Misses       8956     8961       +5     
Impacted Files Coverage Δ
arrow/src/csv/reader.rs 89.42% <ø> (+1.30%) ⬆️
arrow/src/util/reader_parser.rs 57.77% <57.77%> (ø)
arrow/src/json/reader.rs 83.46% <81.63%> (+0.07%) ⬆️
arrow/src/datatypes/datatype.rs 66.40% <0.00%> (-0.40%) ⬇️
parquet/src/arrow/converter.rs 63.96% <0.00%> (-0.39%) ⬇️
parquet_derive/src/parquet_field.rs 65.98% <0.00%> (-0.23%) ⬇️
arrow/src/array/transform/mod.rs 84.51% <0.00%> (-0.13%) ⬇️
arrow/src/array/array_union.rs 90.71% <0.00%> (-0.05%) ⬇️
parquet/src/arrow/arrow_writer.rs 97.56% <0.00%> (-0.02%) ⬇️
arrow/src/array/builder.rs 86.73% <0.00%> (-0.01%) ⬇️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8f7c56e...2d47602. Read the comment docs.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sum12

}
}

fn parse_formatted(string: &str, format: &str) -> Option<i32> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For performance, it probably makes sense to reuse the formatter between values, by e.g. using the format function: https://docs.rs/chrono/latest/chrono/naive/struct.NaiveDate.html#method.format

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storing intermediate results is definitely a good idea. However not sure how to do such things in rust. Each DataType would have a different type for intermediate result. But parse_formatted (or anything new) would expect a unified interface. All these types could be unified under a enum, is that correct ? Not sure if that is what should be done, also it would be a breaking API change ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, i think there is a mis-understanding here.

The method is not printing the date but parsing the string using the given format. I dont think https://docs.rs/chrono/latest/chrono/naive/struct.NaiveDate.html#method.format can be used here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps what @Dandandan was alluding to was to try and avoid the cost of parsing the format string for each element (e.g. try and reuse StrftimeItems somehow rather than recreating them all the time in https://docs.rs/chrono/latest/src/chrono/datetime.rs.html#386-390)

However, it is probably worth nothing that master already does the "parse the format string each time" so it is probably ok to leave that behavior in this PR (we could file a follow on issue for someone to look at if they are interested)

@sum12
Copy link
Contributor Author

sum12 commented Feb 28, 2022

Bump :-)

@alamb
Copy link
Contributor

alamb commented Feb 28, 2022

Sorry @sum12 -- I have been away for this last week. I'll put this PR on my queue to review more carefully shortly

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay @sum12

I think the core of this PR is reasonable.

I am concerned about using an undocumented "format_string" in the schema metadata to control parsing of dates, especially since
https://docs.rs/arrow/9.1.0/arrow/csv/reader/struct.Reader.html#method.new already has a

datetime_format: Option<String>

parameter so this PR would make the Json and CSV reader interfaces inconsistent.

Have you considered extending JsonReader::new() to take a map of field name to format string instead?

))
}

#[allow(clippy::unnecessary_wraps)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is just copy/pasted from build_primitive_array, but I think we could follow this clippy lint rather than ignore it (perhaps as a follow on PR)

}
}

fn parse_formatted(string: &str, format: &str) -> Option<i32> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps what @Dandandan was alluding to was to try and avoid the cost of parsing the format string for each element (e.g. try and reuse StrftimeItems somehow rather than recreating them all the time in https://docs.rs/chrono/latest/src/chrono/datetime.rs.html#386-390)

However, it is probably worth nothing that master already does the "parse the format string each time" so it is probably ok to leave that behavior in this PR (we could file a follow on issue for someone to look at if they are interested)

use chrono::format::Fixed;
use chrono::format::StrftimeItems;
let fmt = StrftimeItems::new(format);
let has_zone = fmt.into_iter().any(|item| match item {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this PR just moves this code around, but I wonder if we could reuse string_to_timestamp_nanos as a follow on PR

https://sourcegraph.com/github.com/apache/arrow-rs/-/blob/arrow/src/compute/kernels/cast_utils.rs?L69&subtree=true

@sum12
Copy link
Contributor Author

sum12 commented Mar 2, 2022

Awesome, thanks for the idea of extending json::Reader::new(). That way metadata remains free of pollution. I wonder if there there a better way to handle breaking API changes

@sum12
Copy link
Contributor Author

sum12 commented Mar 2, 2022

Moved the code shuffling part to a different PR. Hope it helps

@sum12 sum12 force-pushed the json-formatted-datetime-parsing branch 3 times, most recently from 642519c to 2795da9 Compare March 14, 2022 21:06
@sum12
Copy link
Contributor Author

sum12 commented Mar 15, 2022

Updated the PR with format_strings map parameter

the format_string map's key is column name.
The value will be used to parse the date64/date32 types from json
if the read value is of string type

add tests for formatted parser for date{32,64}type for json readers
@sum12
Copy link
Contributor Author

sum12 commented Mar 16, 2022

@alamb, In hopes on making it ligher to review, created a new PR. It probably is cleaner than this one.

@alamb
Copy link
Contributor

alamb commented Mar 29, 2022

Superceded by #1451 so closing this one

@alamb alamb closed this Mar 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants