-
Notifications
You must be signed in to change notification settings - Fork 1k
Add Json DecoderOptions and support custom format_string for each field
#1451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Will try and check this out tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sum12 -- sorry for the delayed review
arrow/src/json/reader.rs
Outdated
| //! let file = File::open("test/data/basic.json").unwrap(); | ||
| //! | ||
| //! let mut json = json::Reader::new(BufReader::new(file), Arc::new(schema), 1024, None); | ||
| //! let mut json = json::Reader::new(BufReader::new(file), Arc::new(schema), 1024, None, None); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 it is probably time to make some sort of JsonReaderOptions struct so we can add new options without changing the public API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a DecoderOptions as decoder is like to get those new options in the future than the reader.
arrow/src/json/reader.rs
Outdated
| schema: SchemaRef, | ||
| batch_size: usize, | ||
| projection: Option<Vec<String>>, | ||
| format_strings: Option<HashMap<String, String>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't help but wonder if we would be better off allowing users to provide their own parsers to support the most general case. For example, what if this was
| format_strings: Option<HashMap<String, String>>, | |
| formaters: Option<HashMap<String, Arc<dyn Parser>>, |
And then the user could pass whatever custom parser they wanted to the JSON reader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I often wondered too. Wouldn't it allow the parser to diverge from schema ? We would need a runtime validation of what schema says and formaters actually return.
Codecov Report
@@ Coverage Diff @@
## master #1451 +/- ##
=======================================
Coverage 82.77% 82.78%
=======================================
Files 190 190
Lines 54827 54862 +35
=======================================
+ Hits 45385 45416 +31
- Misses 9442 9446 +4
Continue to review full report at Codecov.
|
the format_string map's key is column name.
The value will be used to parse the date64/date32 types from json
if the read value is of string type
add tests for formatted parser for date{32,64}type for json readers
all-parsers start
fixup! added format strings (hashmap) to json reader
that way later extensions to the decoder can be added to this struct without breaking API.
|
@alamb re-review please :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry for the delay in reviewing @sum12
I think this one is almost ready to go. Thank you for pulling out out the options
I merged to master and pushed a commit to fix the comments
arrow/src/json/reader.rs
Outdated
| pub struct Decoder { | ||
| /// Explicit schema for the JSON file | ||
| schema: SchemaRef, | ||
| /// Optional projection for which columns to load (case-sensitive names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you may have deleted the wrong comment here the /// Batch size .. comment should remain and the /// Optional projection... should be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed it in 8427ce6
| } | ||
| } | ||
|
|
||
| impl Parser for Time64NanosecondType {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do this, doesn't it mean that integers will be parsed as time64s? Is that what you want to allow? If so can you please explain / write a test that covers the behavior (so we don't accidentally break it in the future?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for pointing that out. Have added tests to parse string from json files to these types. iiuc this latest commit could be also cherry picked to your branch alamb:alamb/batch_size_too ? and this one can be closed.
|
Looks good -- thanks @sum12 |
DecoderOptions and allow custom format_string for each field
DecoderOptions and allow custom format_string for each field DecoderOptions and support custom format_string for each field
Which issue does this PR close?
add implementation for remaining basic types as needed by the json::decoder
Rationale for this change
This approach is seem cleaner than the on e initially proposed in #1301
What changes are included in this PR?
Non-Formatted parsing for timestamp types is included, but formatted parsing is still missing and can be done as a followup PR.
When
serdethinks that thevalueis astringthen these new parsers will kick in and parse the string using provided format or some predefined set of format strings (for timestamp types)for time32/time64 based types no format is specified
Are there any user-facing changes?
A new
format_stringsparameter is added to the reader and decoder