Skip to content

Conversation

@sum12
Copy link
Contributor

@sum12 sum12 commented Mar 16, 2022

Which issue does this PR close?

add implementation for remaining basic types as needed by the json::decoder

Rationale for this change

This approach is seem cleaner than the on e initially proposed in #1301

What changes are included in this PR?

Non-Formatted parsing for timestamp types is included, but formatted parsing is still missing and can be done as a followup PR.
When serde thinks that the value is a string then these new parsers will kick in and parse the string using provided format or some predefined set of format strings (for timestamp types)
for time32/time64 based types no format is specified

Are there any user-facing changes?

A new format_strings parameter is added to the reader and decoder

@github-actions github-actions bot added the arrow Changes to the arrow crate label Mar 16, 2022
@alamb
Copy link
Contributor

alamb commented Mar 22, 2022

Will try and check this out tomorrow

@alamb alamb added the api-change Changes to the arrow API label Mar 29, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sum12 -- sorry for the delayed review

//! let file = File::open("test/data/basic.json").unwrap();
//!
//! let mut json = json::Reader::new(BufReader::new(file), Arc::new(schema), 1024, None);
//! let mut json = json::Reader::new(BufReader::new(file), Arc::new(schema), 1024, None, None);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 it is probably time to make some sort of JsonReaderOptions struct so we can add new options without changing the public API

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a DecoderOptions as decoder is like to get those new options in the future than the reader.

schema: SchemaRef,
batch_size: usize,
projection: Option<Vec<String>>,
format_strings: Option<HashMap<String, String>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't help but wonder if we would be better off allowing users to provide their own parsers to support the most general case. For example, what if this was

Suggested change
format_strings: Option<HashMap<String, String>>,
formaters: Option<HashMap<String, Arc<dyn Parser>>,

And then the user could pass whatever custom parser they wanted to the JSON reader?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I often wondered too. Wouldn't it allow the parser to diverge from schema ? We would need a runtime validation of what schema says and formaters actually return.

@codecov-commenter
Copy link

codecov-commenter commented Apr 2, 2022

Codecov Report

Merging #1451 (8427ce6) into master (08f386a) will increase coverage by 0.00%.
The diff coverage is 80.59%.

@@           Coverage Diff           @@
##           master    #1451   +/-   ##
=======================================
  Coverage   82.77%   82.78%           
=======================================
  Files         190      190           
  Lines       54827    54862   +35     
=======================================
+ Hits        45385    45416   +31     
- Misses       9442     9446    +4     
Impacted Files Coverage Δ
arrow/src/util/reader_parser.rs 67.39% <33.33%> (+2.17%) ⬆️
arrow/src/json/reader.rs 83.65% <87.93%> (+0.11%) ⬆️
arrow/src/datatypes/datatype.rs 66.40% <0.00%> (-0.40%) ⬇️
arrow/src/array/transform/mod.rs 86.35% <0.00%> (-0.12%) ⬇️
parquet_derive/src/parquet_field.rs 66.21% <0.00%> (+0.22%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 08f386a...8427ce6. Read the comment docs.

sum12 added 2 commits April 2, 2022 14:49
the format_string map's key is column name.
The value will be used to parse the date64/date32 types from json
if the read value is of string type

add tests for formatted parser for date{32,64}type for json readers

all-parsers start

fixup! added format strings (hashmap) to json reader
that way later extensions to the decoder can be added to this struct
without breaking API.
@sum12
Copy link
Contributor Author

sum12 commented Apr 10, 2022

@alamb re-review please :-)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry for the delay in reviewing @sum12

I think this one is almost ready to go. Thank you for pulling out out the options

I merged to master and pushed a commit to fix the comments

pub struct Decoder {
/// Explicit schema for the JSON file
schema: SchemaRef,
/// Optional projection for which columns to load (case-sensitive names)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may have deleted the wrong comment here the /// Batch size .. comment should remain and the /// Optional projection... should be removed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it in 8427ce6

}
}

impl Parser for Time64NanosecondType {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do this, doesn't it mean that integers will be parsed as time64s? Is that what you want to allow? If so can you please explain / write a test that covers the behavior (so we don't accidentally break it in the future?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for pointing that out. Have added tests to parse string from json files to these types. iiuc this latest commit could be also cherry picked to your branch alamb:alamb/batch_size_too ? and this one can be closed.

@alamb alamb changed the title Allow json reader/decoder to work with format_strings for each field and use Parser trait correspondingly Allow json reader/decoder to work with format_strings for each field Apr 12, 2022
@alamb
Copy link
Contributor

alamb commented Apr 12, 2022

Looks good -- thanks @sum12

@alamb alamb merged commit 68038f5 into apache:master Apr 12, 2022
@alamb alamb changed the title Allow json reader/decoder to work with format_strings for each field Add Json DecoderOptions and allow custom format_string for each field Apr 15, 2022
@alamb alamb changed the title Add Json DecoderOptions and allow custom format_string for each field Add Json DecoderOptions and support custom format_string for each field Apr 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change Changes to the arrow API arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants