Skip to content

Conversation

masonh22
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

This lets users detect when a Time32 or Time64 has the adjusted_to_u_t_c property set in the parquet file. Previously this only worked if the file was written by arrow-rs and the reader chooses not to skip the arrow metadata.

What changes are included in this PR?

This adds documented constant for "adjusted_to_utc" so that users are aware of this functionality. This key gets added when making a Field from a parquet type if the logical type as the adjusted_to_u_t_c field set.

Are these changes tested?

I updated test_time_utc_roundtrip() to skip the arrow metadata to simulate reading a file that doesn't have the metadata we expect or if a user chooses to skip the arrow metadata when reading.

Are there any user-facing changes?

Yes: this adds a new constant "ADJUSTED_TO_UTC_KEY". It may be helpful to add documentation for this near where DataType::Time32/DataType::Time64 are defined.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 30, 2025
Comment on lines +583 to +584
ret.metadata_mut()
.insert(ADJUSTED_TO_UTC_KEY.to_string(), String::new());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we're not adding this if an arrow hint was found. We probably want to add it in both cases, right?

Comment on lines +228 to +230
/// [`Field::metadata`]: arrow_schema::Field::metadata
/// [`DataType::Time32`]: arrow_schema::DataType::Time32
/// [`DataType::Time64`]: arrow_schema::DataType::Time64
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did I do these links right? How can I verify this? I've never done this before...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can run cargo doc to test them locally -- the CI checks will do so as well, so if the CI is green, this will be good

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @masonh22

The code makes sense to me, but I am a little unclear about your usecase. If you need to find out which columns from a parquet file had the adjusted utc flag, perhaps you could just look at the paquet schema for exmaple

https://docs.rs/parquet/latest/parquet/schema/types/struct.SchemaDescriptor.html

And then trace down to the logical type and https://docs.rs/parquet/latest/parquet/basic/enum.LogicalType.html#variant.Time.field.is_adjusted_to_u_t_c 🤔

Comment on lines +228 to +230
/// [`Field::metadata`]: arrow_schema::Field::metadata
/// [`DataType::Time32`]: arrow_schema::DataType::Time32
/// [`DataType::Time64`]: arrow_schema::DataType::Time64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can run cargo doc to test them locally -- the CI checks will do so as well, so if the CI is green, this will be good

@masonh22
Copy link
Contributor Author

masonh22 commented Aug 4, 2025

@alamb good point! I will try that out.

@masonh22
Copy link
Contributor Author

masonh22 commented Aug 5, 2025

@alamb I tried this out and it is definitely possible, but it adds some complexity that I'd want to avoid. Specifically, when iterating over both arrow Fields and parquet TypePtrs, you need to deal with the fact that list types can be represented multiple ways in parquet (see this section on the backward-compatibility rules for lists if you're not familiar). This would be easier if something like parquet::schema::complex::Visitor was public and more generic.

Still, I think the simplest solution for what I'm trying to do is putting the "adjusted_to_utc" key in the arrow field metadata. This goes along with how arrow-rs handles parquet field IDs, which is another piece of parquet-specific metadata that could be handled by the methods you describe.

@alamb
Copy link
Contributor

alamb commented Aug 6, 2025

Still, I think the simplest solution for what I'm trying to do is putting the "adjusted_to_utc" key in the arrow field metadata. This goes along with how arrow-rs handles parquet field IDs, which is another piece of parquet-specific metadata that could be handled by the methods you describe.

I think I am concerned that we will be adding additional overhead to all readers of schemas even when the feature is not widely used

that being said, one more metadata key is not likely a huge deal 🤔

@masonh22
Copy link
Contributor Author

masonh22 commented Aug 6, 2025

I think I am concerned that we will be adding additional overhead to all readers of schemas even when the feature is not widely used

I agree, I wish there was a better way to do this. I'm guessing that adding a "zoned" boolean field to Time32 and Time64 is out of the question?

Anyways, aside from the check for whether a field is a zoned Time type, this only adds overhead if the parquet file is using zoned timestamps. For "pure" arrow-rs users who aren't reading files from other writers, this will never come up.

I'm fine with sitting on this for a bit if you or others want to take more time to consider this. This isn't a huge priority for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better support for reading and writing UTC adjusted time types in parquet
2 participants