Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Sep 22, 2025

Which issue does this PR close?

Rationale for this change

I was trying to consolidate the parquet extension type code after #8408, and in so doing I believe I actually found (and fixed) the root cause of #7063 (I will point it out inline)

What changes are included in this PR?

  1. Consolidate parquet<-->arrow extension type metadata mapping in one module
  2. Enable tests

Are these changes tested?

Yes

Are there any user-facing changes?

When reading parquet that is annotated with Json or UUID logical types, the resulting Arrow arrays will also have the canonical types attached.

@github-actions github-actions bot added parquet Changes to the parquet crate parquet-variant parquet-variant* crates labels Sep 22, 2025
_ => {}
}
}
if !meta.is_empty() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the actual fix for #7063

The existing code has a very subtle bug -- by calling ret.set_metadata here it wipes out any metadata that was attached to ret that was added by try_with_extension_type

@alamb alamb force-pushed the alamb/consolidate_extension_types branch from 965315a to c305f08 Compare September 22, 2025 20:47
Comment on lines 45 to 47
match parquet_logical_type {
#[cfg(feature = "variant_experimental")]
LogicalType::Variant => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The build flags here are definitely the shortest path to the roundtrip you're aiming for...you could also consider an injection approach like:

pub trait ParquetArrowExtension {
    fn try_from_logical_type(&self, mut arrow_field: Field, logical_type: &LogicalType) -> Result<Option<Field>>;
    fn try_to_logical_type(&self, &Field) -> Result<Option<LogicalType>>;
}

...and maintain a registry of those in the reader/writer options. Then you don't need compile time flags to support the extensions (something like DataFusion or a derivative could wire it all together at runtime).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good idea -- I think @scovich was discussing a registry type approach as well recently. I'll file a ticket to discuss the idea further

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed a ticket to track this idea:

In my mind while the build flag approach in this PR is not ideal, it is no worse than what is on main today, though other people may disagree

@alamb alamb force-pushed the alamb/consolidate_extension_types branch from 53b86bf to c1b5e7a Compare September 26, 2025 16:50
@github-actions github-actions bot removed the parquet-variant parquet-variant* crates label Sep 26, 2025
Comment on lines 45 to 47
match parquet_logical_type {
#[cfg(feature = "variant_experimental")]
LogicalType::Variant => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed a ticket to track this idea:

In my mind while the build flag approach in this PR is not ideal, it is no worse than what is on main today, though other people may disagree

}
// TODO add other LogicalTypes here
_ => arrow_field,
#[cfg(feature = "arrow_canonical_extension_types")]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core change of this PR is moving the code that handles extension types from the schema module to the new extension module and put them behind some named functions.

#[cfg(not(feature = "arrow_canonical_extension_types"))]
None,
)
.with_logical_type(logical_type_for_fixed_size_binary(field))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code is just moved into the extension module

// arrow_schema.field(0).try_extension_type::<Json>()?,
// Json::default()
// );
let arrow_schema = parquet_to_arrow_schema(&parquet_schema, None)?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it works!

Field::new("string", DataType::Utf8, true),
Field::new("string_2", DataType::Utf8, true),
Field::new("json", DataType::Utf8, true),
json_field(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this now actually fails when the canonical extension types are enabled, because a JSON parquet field is now (correctly) annotated with the extension type field

@alamb alamb marked this pull request as ready for review September 26, 2025 17:54
@alamb
Copy link
Contributor Author

alamb commented Sep 26, 2025

Ok, I think this PR is ready for review!

@mbrobbel mbrobbel added this to the 57.0.0 milestone Sep 26, 2025
Copy link
Member

@mbrobbel mbrobbel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alamb, this is great.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting!

@alamb
Copy link
Contributor Author

alamb commented Sep 29, 2025

Thanks @mbrobbel and @paleolimbot -- I am exciting to have this stuff keep moving!

@alamb alamb merged commit 234cc15 into apache:main Sep 29, 2025
16 checks passed
@alamb alamb deleted the alamb/consolidate_extension_types branch September 29, 2025 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support mapping canonical extension types to Parquet LogicalTypes

3 participants