-
Notifications
You must be signed in to change notification settings - Fork 1k
[Variant] Implement read support for remaining primitive types #7644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Variant] Implement read support for remaining primitive types #7644
Conversation
("primitive_string", Variant::String("This string is longer than 64 bytes and therefore does not fit in a short_string and it also includes several non ascii characters such as 🐢, 💖, ♥\u{fe0f}, 🎣 and 🤦!!")), | ||
//("primitive_timestamp", Variant::Null), | ||
//("primitive_timestampntz", Variant::Null), | ||
("primitive_timestamp", Variant::TimestampMicros(NaiveDate::from_ymd_opt(2025, 4, 16).unwrap().and_hms_milli_opt(16, 34, 56, 780).unwrap().and_utc())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regen.py
has this as 2025-04-16T12:34:56.78
whereas it appears to be 2025-04-16T16:34:56.78
in the test case primitive_timestamp.value
(which is the value that this test checks). I believe this is because the timezone is treated as local to the computer that ran the regen script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran the regen script in the Americas/New_York timezone -- does that line up correctly?
If so, then I think we (I) can make a PR to parquet-testing to clarify that in the repo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those seem to be the same two values? Did I misread something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this lines up perfectly: the Eastern timezone is 4 hours behind UTC and the naive input timestamp is being shifted to UTC.
(The two timestamps are 4 hours apart.)
parquet-variant/src/variant.rs
Outdated
pub fn try_new(metadata: &'m VariantMetadata, value: &'v [u8]) -> Result<Self, ArrowError> { | ||
let header = *first_byte_from_slice(value)?; | ||
let new_self = match get_basic_type(header)? { | ||
VariantBasicType::Primitive => match get_primitive_type(header)? { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the decoders manually skip the first byte(header) in their implementation. It could makes sense to just pass in the value with this first byte removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had wondered the same thing... but there's also an argument to be made that the header is indeed part of the variant value. A missing first byte would be a problem if we ever needed to retrieve the full value.
Maybe something to track as the implementation and usage of this API matures? I don't think we have enough information to make a confident decision quite yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to make this change by splitting value: &[u8]
into value_metadata: u8
and value_data: &[u8]
. Keeping track separately enabled simplifying the match.
} | ||
} | ||
|
||
pub fn as_int16(&self) -> Option<i16> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These as_*
methods could benefit from code coverage. Happy to add unit tests for these either in this PR or a future one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it would be great to add tests
What I recommend doing is adding doc examples that will:
- Make it easier to explain what is happening
- Run as part of CI
Something like
let variant_i32 = Variant::from(1337i32);
// you can read an int32 value as int16
assert_eq!(variant_i32.as_i16(), Some(Variant::from(1337i16));
// but not as an int8
assert_eq(variant_i32.as_i8(), None)
If you have time to do it in this PR that would be great, otherwise a follow on would be wonderful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doctests are a great idea. I added them for all the methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @superserious-dev and @scovich -- I think this is a very nice step forward and contribution. It would be nice to clean up the left over TODO
s and add the docs and tests for the as_*
methods, but I also think this PR could be merged as is and that done as a follow on
Just let me know what you prefer.
cc @mkarbo
parquet-variant/src/decoder.rs
Outdated
Decimal4 = 8, | ||
Decimal8 = 9, | ||
Decimal16 = 10, | ||
// TODO: Add types for the rest of primitives, once API is agreed upon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove this TODO as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still missing some primitive types (like Double
, UUID
) from the spec. I'll admit to not following the variant development closely, but it seems the TODO is still relevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added Double. The other primitives don't appear to be ready to implement yet based on this comment.
("primitive_string", Variant::String("This string is longer than 64 bytes and therefore does not fit in a short_string and it also includes several non ascii characters such as 🐢, 💖, ♥\u{fe0f}, 🎣 and 🤦!!")), | ||
//("primitive_timestamp", Variant::Null), | ||
//("primitive_timestampntz", Variant::Null), | ||
("primitive_timestamp", Variant::TimestampMicros(NaiveDate::from_ymd_opt(2025, 4, 16).unwrap().and_hms_milli_opt(16, 34, 56, 780).unwrap().and_utc())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran the regen script in the Americas/New_York timezone -- does that line up correctly?
If so, then I think we (I) can make a PR to parquet-testing to clarify that in the repo
parquet-variant/src/decoder.rs
Outdated
|
||
/// Decodes an Int32 from the value section of a variant. | ||
pub(crate) fn decode_int32(value: &[u8]) -> Result<i32, ArrowError> { | ||
let value = i32::from_le_bytes(array_from_slice(value, 1)?); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to double check the reason for this constant 1
a few times. I found it pretty confusing that it is an offset.
We can always find a better way to structure this code in subsequent PRs, but it might make sense to slice the input value
so this code always looks like the following (no offset) -- and eventually remove the offset parameter all together
let value = i32::from_le_bytes(array_from_slice(value, 0)?);
That being said since it is all crate private, I think this is totally good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the change to shift the base offset to 0.
} | ||
} | ||
|
||
pub fn as_int16(&self) -> Option<i16> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it would be great to add tests
What I recommend doing is adding doc examples that will:
- Make it easier to explain what is happening
- Run as part of CI
Something like
let variant_i32 = Variant::from(1337i32);
// you can read an int32 value as int16
assert_eq!(variant_i32.as_i16(), Some(Variant::from(1337i16));
// but not as an int8
assert_eq(variant_i32.as_i8(), None)
If you have time to do it in this PR that would be great, otherwise a follow on would be wonderful
I'll push up another revision to address the comments in the near future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Three main questions:
- How permissive/aggressive should the
Variant::as_XXX
functions be in the casting they perform? - Should we define helper structs for decimal variant types, so we don't have to pass (value, scale) pairs manually everywhere we use them?
- Does it actually make sense to provide
From
for the various primitive subtypes?
See comments for details.
Decimal4 { integer: i32, scale: u8 }, | ||
Decimal8 { integer: i64, scale: u8 }, | ||
Decimal16 { integer: i128, scale: u8 }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's hard to pass around the content of tuple/struct enum variants with multiple members -- you have to unpack them and pass scalar values around, which is bulky and error-prone (**). Should we consider creating a struct or family of structs for representing scalar decimal values?
(**)
It would be lovely if Variant::Decimal4
were an actual type in rust. It actually is a type under the hood, but the language doesn't expose it as such, because the enum discriminant has to be stored somewhere and rust usually embeds it in the variant itself to save space. See https://rust-lang.github.io/rfcs/2195-really-tagged-unions.html#guide-level-explanation for gory details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely something to consider. Perhaps the naming of the fields and potential new struct(s) could be refined in a follow-up PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will file a ticket
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parquet-variant/src/variant.rs
Outdated
} | ||
} | ||
|
||
impl<'m, 'v> From<i32> for Variant<'m, 'v> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how useful these From
would be in practice, especially once a variant builder/encoder API lands?
What use case(s) would they serve?
Are they designed to make unit tests easier to write?
Do we anticipate the variant builder/encoder would use them internally?
Also:
Variant::String
andVariant::ShortString
are problematic because there's no way to e.g. embed one in aVariant::Object
later on.- We anyway couldn't create
Variant::Object
orVariant::Array
(they would need a mutableVariantMetadataBuilder
to store field names in, etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they are useful for writing tests / examples, FWIW
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simple ones, at least. We'll need actual builders in order to construct the complex types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed -- see the builder API from @PinkCrow007:
parquet-variant/src/variant.rs
Outdated
} | ||
} | ||
|
||
impl<'m, 'v> From<i64> for Variant<'m, 'v> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Newer versions of clippy suggest:
impl<'m, 'v> From<i64> for Variant<'m, 'v> { | |
impl From<i64> for Variant<'_, '_> { |
(several of these cases)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. Was able to see these lints by using cargo clippy -p parquet-variant -- -Wclippy::pedantic
Date(NaiveDate), | ||
TimestampMicros(DateTime<Utc>), | ||
TimestampNTZMicros(NaiveDateTime), | ||
Decimal4 { integer: i32, scale: u8 }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spec calls the integer form of a decimal an "unscaled value."
Should we try to capture that in our naming here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a good thing to do -- let's do it as a follow on ticket. I will file one
("primitive_string", Variant::String("This string is longer than 64 bytes and therefore does not fit in a short_string and it also includes several non ascii characters such as 🐢, 💖, ♥\u{fe0f}, 🎣 and 🤦!!")), | ||
//("primitive_timestamp", Variant::Null), | ||
//("primitive_timestampntz", Variant::Null), | ||
("primitive_timestamp", Variant::TimestampMicros(NaiveDate::from_ymd_opt(2025, 4, 16).unwrap().and_hms_milli_opt(16, 34, 56, 780).unwrap().and_utc())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those seem to be the same two values? Did I misread something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @superserious-dev and @scovich -- I think this PR is in good shape and I'll merge it and file some additional follow on tasks / PRs
Date(NaiveDate), | ||
TimestampMicros(DateTime<Utc>), | ||
TimestampNTZMicros(NaiveDateTime), | ||
Decimal4 { integer: i32, scale: u8 }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a good thing to do -- let's do it as a follow on ticket. I will file one
Decimal4 { integer: i32, scale: u8 }, | ||
Decimal8 { integer: i64, scale: u8 }, | ||
Decimal16 { integer: i128, scale: u8 }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will file a ticket
Variant::Int16(i) => i.try_into().ok(), | ||
Variant::Int32(i) => i.try_into().ok(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, these included APIs are just a convenience
Since the rust API directly exposes the Variant enum, I think it is flexible enough that people can implement whatever semantics they want on top of it using match
match variant {
Variant::Int32(val) => { .. do system specific casting .. }
}
I'll try and contribute an example shortly
} | ||
} | ||
|
||
pub fn as_decimal_int32(&self) -> Option<(i32, u8)> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep this one for now
|
# Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Follow on to #7644 - Part of #6736 # Rationale for this change Using the parquet APIs came up in #7644 (comment) so I wanted to help contribute some additional documentation / tests # What changes are included in this PR? Add documentation and tests about `Variant`, specifically some examples of how to create `Variant` values # Are there any user-facing changes? More docs --------- Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>
Which issue does this PR close?
Closes #7630.
What changes are included in this PR?
This PR implements support for the following primitive variant types:
The following types are not yet implemented(see here for details):
Are there any user-facing changes?
Users who opt-in to the Variant feature can use these primitives.