Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP Upgrade to arrow-rs/parquet 54.0.0 #13663

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Dec 5, 2024

Which issue does this PR close?

Related to

Rationale for this change

Prepare DataFusion for the next arrow-rs release

Also, help test to ensure that the next arrow-rs release has no regressions by testing with DataFusion before we release it

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added common Related to common crate core Core DataFusion crate labels Dec 5, 2024
@alamb alamb changed the title WIP Upgrade to next arrow-rs release WIP Upgrade to arrow-rs 54.0.0 Dec 5, 2024
@alamb alamb changed the title WIP Upgrade to arrow-rs 54.0.0 WIP Upgrade to arrow-rs/parquet 54.0.0 Dec 6, 2024
@alamb alamb force-pushed the alamb/upgrade_to_arrow branch from bd6517c to 44bb39d Compare December 6, 2024 21:33
@alamb alamb changed the title WIP Upgrade to arrow-rs/parquet 54.0.0 WIP Upgrade to arrow-rs/parquet 54.0.0 Dec 6, 2024
@alamb alamb force-pushed the alamb/upgrade_to_arrow branch from 3c5f100 to 51fdc1e Compare December 16, 2024 19:06
@github-actions github-actions bot added physical-expr Physical Expressions proto Related to proto crate labels Dec 16, 2024
@@ -1777,6 +1777,8 @@ fn round_trip_datatype() {
}
}

// TODO file a ticket about handling deprecated dict_id attributes
#[allow(deprecated)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what to do here:

  1. Should we remove the field_id from DataFusion serialization code (as it is planned for removal eventually in DataFusion)?
  2. Should we un-deprecate field_id in arrow-rs?
  3. Something else?

@avantgardnerio, @thinkharderdev and @Dandandan , this test was added in

However, @brancz's upstream PR in arrow-rs has deprecated dict_id on Field and all related APIs with the idea it will be eventually removed (and become a detail of the IPC transport rather than visible in the Schema)

@thinkharderdev mentioned apache/arrow-rs#6873 (comment)

If people are using the dictionary ID it's probably for something that should properly be done with kv metadata instead.

So one possibility could be to provide some more documentation / writeup as an upgrade guide for arrow explaining how to use field metadata instead ?

Please advise

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If people are using the dictionary ID it's probably for something that should properly be done with kv metadata instead

I mostly meant that if you give people 32 bits of space someone, somewhere will use it as metadata if they don't need it for anything else. I don't really have any particular use case in mind.

I don't have any objection to removing the field from datafusion serde. When we added that test we were still fighting to try and ensure unique dict IDs but we've long since moved on and found better ways around it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe on second thought we should keep it around until it is fully removed from upstream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that sounds like a reasonable approach. I'll update the comments to match if no one else has a usecase

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Also, #8457 (comment) suggests it was never important to set the dict id directly, just that it wasn't sent correctly on the wire and already called out what to do once we remove the dict id. There were in fact a number of dict fixes that were necessary for us to correctly communicate the dicts correctly over the wire as well.

@alamb alamb force-pushed the alamb/upgrade_to_arrow branch from fb080a2 to 2fe785b Compare December 21, 2024 02:18
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Dec 21, 2024
@@ -459,10 +459,6 @@ config_namespace! {
/// default parquet writer setting
pub statistics_enabled: Option<String>, transform = str::to_lowercase, default = Some("page".into())

/// (writing) Sets max statistics size for any column. If NULL, uses
/// default parquet writer setting
pub max_statistics_size: Option<usize>, default = Some(4096)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think simply removing the config option is pretty rough as you end up with errors liek this

External error: query failed: DataFusion error: Invalid or Unsupported Configuration: Config value "max_statistics_size" not found on ParquetOptions

The best thing to do would be to mark the config setting deprecated, but there wasn't an obvious way to do this with the existing macros. I will futz with the macros and see if I can make it deprecated somehow

@@ -321,6 +321,8 @@ impl TryFrom<&protobuf::Field> for Field {
fn try_from(field: &protobuf::Field) -> Result<Self, Self::Error> {
let datatype = field.arrow_type.as_deref().required("arrow_type")?;
let field = if field.dict_id != 0 {
// TODO file a ticket about handling deprecated dict_id attributes
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to update per apache/arrow-rs#6885 I think

@emilk
Copy link

emilk commented Jan 9, 2025

I would appreciate this PR getting in before the next datafusion release 🙏

@alamb alamb mentioned this pull request Jan 10, 2025

/// Sets max statistics size for the column path. If NULL, uses
/// default parquet options
pub max_statistics_size: Option<usize>, default = None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a mistake to remove this field directly -- we should put it back in and figure out some way to mark it deprecated

@alamb
Copy link
Contributor Author

alamb commented Jan 12, 2025

I would appreciate this PR getting in before the next datafusion release 🙏

Yes I hope to -- if anyone has time to help get it in shape / make a new PR it would be most appreciated

I think a good first step would be to upgrade DataFusion to use the new arrow / parquet versions and sprinkle some #allow(deprecated) when necessary

Then we can file follow on tickets to sort out exactly how to deal with the deprecated APIs as a follow on PR

I'll try to work on this over the next week if no one beats me to it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate physical-expr Physical Expressions proto Related to proto crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants