Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Add SchemaInferenceOptions options to infer_schema and option to configure int96 inference #1533

Merged

Conversation

jaychia
Copy link
Contributor

@jaychia jaychia commented Aug 9, 2023

This PR addresses part 2 of #1527

It solves the problem of configuring arrow2's Parquet schema inference to infer Timestamp fields from Parquet Int96 fields differently based on user input.

  1. Adds a new SchemaInferenceOptions struct which allows for configurability of how schema inference on Parquet files
  2. Adds a int96_coerce_to_timeunit flag to configure how Parquet int96 fields are inferred as arrow Timestamps
  3. Adds *_with_options variants of the infer_schema and parquet_to_arrow_schema APIs to take in the options

@jaychia jaychia changed the title Add options to infer_schema Add SchemaInferenceOptions options to infer_schema and option to configure int96 inference Aug 9, 2023
@codecov
Copy link

codecov bot commented Sep 5, 2023

Codecov Report

Patch coverage: 95.23% and project coverage change: +0.02% 🎉

Comparison is base (87ab844) 83.02% compared to head (4e4279f) 83.05%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1533      +/-   ##
==========================================
+ Coverage   83.02%   83.05%   +0.02%     
==========================================
  Files         391      391              
  Lines       42786    42866      +80     
==========================================
+ Hits        35523    35602      +79     
- Misses       7263     7264       +1     
Files Changed Coverage Δ
src/io/parquet/read/schema/convert.rs 94.68% <94.62%> (+0.47%) ⬆️
src/io/parquet/read/schema/mod.rs 100.00% <100.00%> (ø)

... and 5 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jaychia
Copy link
Contributor Author

jaychia commented Sep 5, 2023

Hi @ritchie46 and @sundy-li

Here's a follow-up PR to #1532

(see PR description for more details)

@sundy-li
Copy link
Collaborator

sundy-li commented Sep 6, 2023

BTW, int96 seems to be deprecated in parquet, it's not a stable feature. https://issues.apache.org/jira/browse/PARQUET-323

@jaychia
Copy link
Contributor Author

jaychia commented Sep 6, 2023

BTW, int96 seems to be deprecated in parquet, it's not a stable feature. https://issues.apache.org/jira/browse/PARQUET-323

Indeed, but it is still widely used and supported by many systems for backwards-compatibility reasons

Unfortunately because Parquet is a long-lived format, and many enterprises use old versions of data frameworks, these deprecated features tend to live long after their deprecation :)

@sundy-li sundy-li merged commit fb7b5fe into jorgecarleitao:main Sep 7, 2023
25 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants