-
Notifications
You must be signed in to change notification settings - Fork 1k
Support parquet canonical extension type roundtrip #8409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| _ => {} | ||
| } | ||
| } | ||
| if !meta.is_empty() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the actual fix for #7063
The existing code has a very subtle bug -- by calling ret.set_metadata here it wipes out any metadata that was attached to ret that was added by try_with_extension_type
965315a to
c305f08
Compare
| match parquet_logical_type { | ||
| #[cfg(feature = "variant_experimental")] | ||
| LogicalType::Variant => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The build flags here are definitely the shortest path to the roundtrip you're aiming for...you could also consider an injection approach like:
pub trait ParquetArrowExtension {
fn try_from_logical_type(&self, mut arrow_field: Field, logical_type: &LogicalType) -> Result<Option<Field>>;
fn try_to_logical_type(&self, &Field) -> Result<Option<LogicalType>>;
}...and maintain a registry of those in the reader/writer options. Then you don't need compile time flags to support the extensions (something like DataFusion or a derivative could wire it all together at runtime).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good idea -- I think @scovich was discussing a registry type approach as well recently. I'll file a ticket to discuss the idea further
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed a ticket to track this idea:
In my mind while the build flag approach in this PR is not ideal, it is no worse than what is on main today, though other people may disagree
53b86bf to
c1b5e7a
Compare
| match parquet_logical_type { | ||
| #[cfg(feature = "variant_experimental")] | ||
| LogicalType::Variant => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed a ticket to track this idea:
In my mind while the build flag approach in this PR is not ideal, it is no worse than what is on main today, though other people may disagree
| } | ||
| // TODO add other LogicalTypes here | ||
| _ => arrow_field, | ||
| #[cfg(feature = "arrow_canonical_extension_types")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The core change of this PR is moving the code that handles extension types from the schema module to the new extension module and put them behind some named functions.
| #[cfg(not(feature = "arrow_canonical_extension_types"))] | ||
| None, | ||
| ) | ||
| .with_logical_type(logical_type_for_fixed_size_binary(field)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code is just moved into the extension module
| // arrow_schema.field(0).try_extension_type::<Json>()?, | ||
| // Json::default() | ||
| // ); | ||
| let arrow_schema = parquet_to_arrow_schema(&parquet_schema, None)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it works!
| Field::new("string", DataType::Utf8, true), | ||
| Field::new("string_2", DataType::Utf8, true), | ||
| Field::new("json", DataType::Utf8, true), | ||
| json_field(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this now actually fails when the canonical extension types are enabled, because a JSON parquet field is now (correctly) annotated with the extension type field
|
Ok, I think this PR is ready for review! |
mbrobbel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alamb, this is great.
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exciting!
|
Thanks @mbrobbel and @paleolimbot -- I am exciting to have this stuff keep moving! |
Which issue does this PR close?
VariantArrayto parquet with Variant LogicalType #8408Rationale for this change
I was trying to consolidate the parquet extension type code after #8408, and in so doing I believe I actually found (and fixed) the root cause of #7063 (I will point it out inline)
What changes are included in this PR?
Are these changes tested?
Yes
Are there any user-facing changes?
When reading parquet that is annotated with Json or UUID logical types, the resulting Arrow arrays will also have the canonical types attached.