-
Notifications
You must be signed in to change notification settings - Fork 996
Add ADJUSTED_TO_UTC_KEY constant and add it to field metadata when reading TIME logical types #8028
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
ret.metadata_mut() | ||
.insert(ADJUSTED_TO_UTC_KEY.to_string(), String::new()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we're not adding this if an arrow hint was found. We probably want to add it in both cases, right?
/// [`Field::metadata`]: arrow_schema::Field::metadata | ||
/// [`DataType::Time32`]: arrow_schema::DataType::Time32 | ||
/// [`DataType::Time64`]: arrow_schema::DataType::Time64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did I do these links right? How can I verify this? I've never done this before...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can run cargo doc
to test them locally -- the CI checks will do so as well, so if the CI is green, this will be good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @masonh22
The code makes sense to me, but I am a little unclear about your usecase. If you need to find out which columns from a parquet file had the adjusted utc flag, perhaps you could just look at the paquet schema for exmaple
https://docs.rs/parquet/latest/parquet/schema/types/struct.SchemaDescriptor.html
And then trace down to the logical type and https://docs.rs/parquet/latest/parquet/basic/enum.LogicalType.html#variant.Time.field.is_adjusted_to_u_t_c 🤔
/// [`Field::metadata`]: arrow_schema::Field::metadata | ||
/// [`DataType::Time32`]: arrow_schema::DataType::Time32 | ||
/// [`DataType::Time64`]: arrow_schema::DataType::Time64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can run cargo doc
to test them locally -- the CI checks will do so as well, so if the CI is green, this will be good
@alamb good point! I will try that out. |
@alamb I tried this out and it is definitely possible, but it adds some complexity that I'd want to avoid. Specifically, when iterating over both arrow Still, I think the simplest solution for what I'm trying to do is putting the "adjusted_to_utc" key in the arrow field metadata. This goes along with how arrow-rs handles parquet field IDs, which is another piece of parquet-specific metadata that could be handled by the methods you describe. |
I think I am concerned that we will be adding additional overhead to all readers of schemas even when the feature is not widely used that being said, one more metadata key is not likely a huge deal 🤔 |
I agree, I wish there was a better way to do this. I'm guessing that adding a "zoned" boolean field to Time32 and Time64 is out of the question? Anyways, aside from the check for whether a field is a zoned Time type, this only adds overhead if the parquet file is using zoned timestamps. For "pure" arrow-rs users who aren't reading files from other writers, this will never come up. I'm fine with sitting on this for a bit if you or others want to take more time to consider this. This isn't a huge priority for me. |
Which issue does this PR close?
Rationale for this change
This lets users detect when a Time32 or Time64 has the adjusted_to_u_t_c property set in the parquet file. Previously this only worked if the file was written by arrow-rs and the reader chooses not to skip the arrow metadata.
What changes are included in this PR?
This adds documented constant for "adjusted_to_utc" so that users are aware of this functionality. This key gets added when making a
Field
from a parquet type if the logical type as the adjusted_to_u_t_c field set.Are these changes tested?
I updated
test_time_utc_roundtrip()
to skip the arrow metadata to simulate reading a file that doesn't have the metadata we expect or if a user chooses to skip the arrow metadata when reading.Are there any user-facing changes?
Yes: this adds a new constant "ADJUSTED_TO_UTC_KEY". It may be helpful to add documentation for this near where
DataType::Time32
/DataType::Time64
are defined.