Skip to content

Conversation

superserious-dev
Copy link
Contributor

@superserious-dev superserious-dev commented Jun 11, 2025

Which issue does this PR close?

Closes #7630.

What changes are included in this PR?

This PR implements support for the following primitive variant types:

  • Binary
  • Date
  • TimestampMicros
  • TimestampNtzMicros
  • Int16
  • Int32
  • Int64
  • Decimal4
  • Decimal8
  • Decimal16
  • Float
  • Double

The following types are not yet implemented(see here for details):

  • TimeNTZ
  • TimestampNanos
  • TimestampNtzNanos
  • UUID

Are there any user-facing changes?

Users who opt-in to the Variant feature can use these primitives.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 11, 2025
("primitive_string", Variant::String("This string is longer than 64 bytes and therefore does not fit in a short_string and it also includes several non ascii characters such as 🐢, 💖, ♥\u{fe0f}, 🎣 and 🤦!!")),
//("primitive_timestamp", Variant::Null),
//("primitive_timestampntz", Variant::Null),
("primitive_timestamp", Variant::TimestampMicros(NaiveDate::from_ymd_opt(2025, 4, 16).unwrap().and_hms_milli_opt(16, 34, 56, 780).unwrap().and_utc())),
Copy link
Contributor Author

@superserious-dev superserious-dev Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regen.py has this as 2025-04-16T12:34:56.78 whereas it appears to be 2025-04-16T16:34:56.78 in the test case primitive_timestamp.value(which is the value that this test checks). I believe this is because the timezone is treated as local to the computer that ran the regen script.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the regen script in the Americas/New_York timezone -- does that line up correctly?

If so, then I think we (I) can make a PR to parquet-testing to clarify that in the repo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those seem to be the same two values? Did I misread something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this lines up perfectly: the Eastern timezone is 4 hours behind UTC and the naive input timestamp is being shifted to UTC.
(The two timestamps are 4 hours apart.)

pub fn try_new(metadata: &'m VariantMetadata, value: &'v [u8]) -> Result<Self, ArrowError> {
let header = *first_byte_from_slice(value)?;
let new_self = match get_basic_type(header)? {
VariantBasicType::Primitive => match get_primitive_type(header)? {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the decoders manually skip the first byte(header) in their implementation. It could makes sense to just pass in the value with this first byte removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had wondered the same thing... but there's also an argument to be made that the header is indeed part of the variant value. A missing first byte would be a problem if we ever needed to retrieve the full value.

Maybe something to track as the implementation and usage of this API matures? I don't think we have enough information to make a confident decision quite yet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to make this change by splitting value: &[u8]into value_metadata: u8 and value_data: &[u8]. Keeping track separately enabled simplifying the match.

}
}

pub fn as_int16(&self) -> Option<i16> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These as_* methods could benefit from code coverage. Happy to add unit tests for these either in this PR or a future one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it would be great to add tests

What I recommend doing is adding doc examples that will:

  1. Make it easier to explain what is happening
  2. Run as part of CI

Something like

let variant_i32 = Variant::from(1337i32);
// you can read an int32 value as int16
assert_eq!(variant_i32.as_i16(), Some(Variant::from(1337i16));
// but not as an int8
assert_eq(variant_i32.as_i8(), None)

If you have time to do it in this PR that would be great, otherwise a follow on would be wonderful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doctests are a great idea. I added them for all the methods.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @superserious-dev and @scovich -- I think this is a very nice step forward and contribution. It would be nice to clean up the left over TODOs and add the docs and tests for the as_* methods, but I also think this PR could be merged as is and that done as a follow on

Just let me know what you prefer.

cc @mkarbo

Decimal4 = 8,
Decimal8 = 9,
Decimal16 = 10,
// TODO: Add types for the rest of primitives, once API is agreed upon
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove this TODO as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still missing some primitive types (like Double, UUID) from the spec. I'll admit to not following the variant development closely, but it seems the TODO is still relevant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added Double. The other primitives don't appear to be ready to implement yet based on this comment.

("primitive_string", Variant::String("This string is longer than 64 bytes and therefore does not fit in a short_string and it also includes several non ascii characters such as 🐢, 💖, ♥\u{fe0f}, 🎣 and 🤦!!")),
//("primitive_timestamp", Variant::Null),
//("primitive_timestampntz", Variant::Null),
("primitive_timestamp", Variant::TimestampMicros(NaiveDate::from_ymd_opt(2025, 4, 16).unwrap().and_hms_milli_opt(16, 34, 56, 780).unwrap().and_utc())),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the regen script in the Americas/New_York timezone -- does that line up correctly?

If so, then I think we (I) can make a PR to parquet-testing to clarify that in the repo


/// Decodes an Int32 from the value section of a variant.
pub(crate) fn decode_int32(value: &[u8]) -> Result<i32, ArrowError> {
let value = i32::from_le_bytes(array_from_slice(value, 1)?);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to double check the reason for this constant 1 a few times. I found it pretty confusing that it is an offset.

We can always find a better way to structure this code in subsequent PRs, but it might make sense to slice the input value so this code always looks like the following (no offset) -- and eventually remove the offset parameter all together

    let value = i32::from_le_bytes(array_from_slice(value, 0)?);

That being said since it is all crate private, I think this is totally good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the change to shift the base offset to 0.

}
}

pub fn as_int16(&self) -> Option<i16> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it would be great to add tests

What I recommend doing is adding doc examples that will:

  1. Make it easier to explain what is happening
  2. Run as part of CI

Something like

let variant_i32 = Variant::from(1337i32);
// you can read an int32 value as int16
assert_eq!(variant_i32.as_i16(), Some(Variant::from(1337i16));
// but not as an int8
assert_eq(variant_i32.as_i8(), None)

If you have time to do it in this PR that would be great, otherwise a follow on would be wonderful

@superserious-dev
Copy link
Contributor Author

superserious-dev commented Jun 12, 2025

Thanks @superserious-dev and @scovich -- I think this is a very nice step forward and contribution. It would be nice to clean up the left over TODOs and add the docs and tests for the as_* methods, but I also think this PR could be merged as is and that done as a follow on

Just let me know what you prefer.

cc @mkarbo

I'll push up another revision to address the comments in the near future.

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three main questions:

  • How permissive/aggressive should the Variant::as_XXX functions be in the casting they perform?
  • Should we define helper structs for decimal variant types, so we don't have to pass (value, scale) pairs manually everywhere we use them?
  • Does it actually make sense to provide From for the various primitive subtypes?

See comments for details.

Comment on lines +412 to +414
Decimal4 { integer: i32, scale: u8 },
Decimal8 { integer: i64, scale: u8 },
Decimal16 { integer: i128, scale: u8 },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to pass around the content of tuple/struct enum variants with multiple members -- you have to unpack them and pass scalar values around, which is bulky and error-prone (**). Should we consider creating a struct or family of structs for representing scalar decimal values?

(**)

It would be lovely if Variant::Decimal4 were an actual type in rust. It actually is a type under the hood, but the language doesn't expose it as such, because the enum discriminant has to be stored somewhere and rust usually embeds it in the variant itself to save space. See https://rust-lang.github.io/rfcs/2195-really-tagged-unions.html#guide-level-explanation for gory details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely something to consider. Perhaps the naming of the fields and potential new struct(s) could be refined in a follow-up PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will file a ticket

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
}

impl<'m, 'v> From<i32> for Variant<'m, 'v> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how useful these From would be in practice, especially once a variant builder/encoder API lands?

What use case(s) would they serve?
Are they designed to make unit tests easier to write?
Do we anticipate the variant builder/encoder would use them internally?

Also:

  • Variant::String and Variant::ShortString are problematic because there's no way to e.g. embed one in a Variant::Object later on.
  • We anyway couldn't create Variant::Object or Variant::Array (they would need a mutable VariantMetadataBuilder to store field names in, etc.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they are useful for writing tests / examples, FWIW

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simple ones, at least. We'll need actual builders in order to construct the complex types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
}

impl<'m, 'v> From<i64> for Variant<'m, 'v> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Newer versions of clippy suggest:

Suggested change
impl<'m, 'v> From<i64> for Variant<'m, 'v> {
impl From<i64> for Variant<'_, '_> {

(several of these cases)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Was able to see these lints by using cargo clippy -p parquet-variant -- -Wclippy::pedantic

Date(NaiveDate),
TimestampMicros(DateTime<Utc>),
TimestampNTZMicros(NaiveDateTime),
Decimal4 { integer: i32, scale: u8 },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec calls the integer form of a decimal an "unscaled value."
Should we try to capture that in our naming here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good thing to do -- let's do it as a follow on ticket. I will file one

("primitive_string", Variant::String("This string is longer than 64 bytes and therefore does not fit in a short_string and it also includes several non ascii characters such as 🐢, 💖, ♥\u{fe0f}, 🎣 and 🤦!!")),
//("primitive_timestamp", Variant::Null),
//("primitive_timestampntz", Variant::Null),
("primitive_timestamp", Variant::TimestampMicros(NaiveDate::from_ymd_opt(2025, 4, 16).unwrap().and_hms_milli_opt(16, 34, 56, 780).unwrap().and_utc())),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those seem to be the same two values? Did I misread something?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @superserious-dev and @scovich -- I think this PR is in good shape and I'll merge it and file some additional follow on tasks / PRs

Date(NaiveDate),
TimestampMicros(DateTime<Utc>),
TimestampNTZMicros(NaiveDateTime),
Decimal4 { integer: i32, scale: u8 },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good thing to do -- let's do it as a follow on ticket. I will file one

Comment on lines +412 to +414
Decimal4 { integer: i32, scale: u8 },
Decimal8 { integer: i64, scale: u8 },
Decimal16 { integer: i128, scale: u8 },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will file a ticket

Comment on lines +530 to +531
Variant::Int16(i) => i.try_into().ok(),
Variant::Int32(i) => i.try_into().ok(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, these included APIs are just a convenience

Since the rust API directly exposes the Variant enum, I think it is flexible enough that people can implement whatever semantics they want on top of it using match

match variant {
  Variant::Int32(val) => { .. do system specific casting .. }
}

I'll try and contribute an example shortly

}
}

pub fn as_decimal_int32(&self) -> Option<(i32, u8)> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep this one for now

@alamb alamb merged commit 71ee9d9 into apache:main Jun 13, 2025
14 checks passed
@alamb
Copy link
Contributor

alamb commented Jun 13, 2025

I'll try and contribute an example shortly

@superserious-dev superserious-dev deleted the primitive-type-read-support branch June 13, 2025 16:14
alamb added a commit that referenced this pull request Jun 19, 2025
# Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.

- Follow on to #7644
- Part of #6736

# Rationale for this change

Using the parquet APIs came up in
#7644 (comment) so I
wanted to help contribute some additional documentation / tests

# What changes are included in this PR?

Add documentation and tests about `Variant`, specifically some examples
of how to create `Variant` values

# Are there any user-facing changes?

More docs

---------

Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Variant] Implement read support for remaining primitive types
4 participants