Skip to content

Conversation

scovich
Copy link
Contributor

@scovich scovich commented Sep 23, 2025

Which issue does this PR close?

Rationale for this change

Today, Variant must always refer to borrowed bytes, for objects and arrays (primitive values are parsed out and have no byte references at all). Unfortunately, when VariantArray::value unshreds a value, the resulting bytes are owned, not borrowed. Which makes it impossible for that method to meaningfully support arrays and objects.

What changes are included in this PR?

Define a new VariantArrayValue wrapper, which allows returning variant values based on owned or borrowed bytes. It hides most of the complexity from call sites by defining many similar methods as Variant and also by implementing a strategic selection of traits such as PartialEq.

The biggest user surface change is that VariantArrayBuilder code of the form:

builder.append_variant(variant_array.value(index));

becomes

variant_array
    .value(index)
    .consume(|value| builder.append_variant(value));

Are these changes tested?

Extensive coverage from existing unit tests (many of which were updated to use the new API)

Are there any user-facing changes?

Yes. The new VariantArrayValue class is public, and VariantArray::value type signature changed to return it.

@github-actions github-actions bot added the parquet-variant parquet-variant* crates label Sep 23, 2025
Copy link
Contributor Author

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review

Comment on lines +135 to +136
VariantArrayValue::Borrowed(v) => v.metadata(),
VariantArrayValue::Owned { metadata, .. } => metadata,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method can be implemented directly, without resorting to the cost of calling self.as_variant_cow()

/// Extracts the underlying [`VariantList`], if this is a variant array.
///
/// See also [`Variant::as_list`].
pub fn as_list(&self) -> Option<VariantList<'m, '_>> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two as_xxx methods are not as efficient as their Variant counterparts, because they must return an owned object/list rather than a borrowed reference. But they're also used only in test code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a note in the docs clarifying that this is test-only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, if we think it's important we could mark them #[cfg(test)] to force them test-only? But I'm not sure it really matters? They produce correct results, just not super-efficiently.

/// Extracts the value of the variant array element at `index`, if this is a variant object.
///
/// See also [`Variant::get_list_element`].
pub fn get_list_element(&self, index: usize) -> Option<Variant<'m, '_>> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Variant these two methods return is limited to the lifetime of self, because it could be owned.
This restriction didn't seem to impact any any code. Only unit tests call these methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were some weird uses of 'm lifetime that should never have been there and caused lifetime problems in this PR.

Copy link
Contributor

@liamzwbao liamzwbao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the change, just few nits

| ShreddingState::PartiallyShredded { typed_value, .. }
if typed_value.is_valid(index) =>
{
return typed_value_to_variant(typed_value, index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

supernit: match the style with the other arms

Suggested change
return typed_value_to_variant(typed_value, index);
typed_value_to_variant(typed_value, index)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the return is needed due to differing return type...

/// Extracts the underlying [`VariantList`], if this is a variant array.
///
/// See also [`Variant::as_list`].
pub fn as_list(&self) -> Option<VariantList<'m, '_>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a note in the docs clarifying that this is test-only?

@scovich
Copy link
Contributor Author

scovich commented Sep 25, 2025

@alamb -- any thoughts about this approach?
(it will conflict a bit with #8438, but the overall shape won't change)

@scovich
Copy link
Contributor Author

scovich commented Sep 25, 2025

@alamb -- any thoughts about this approach? (it will conflict a bit with #8438, but the overall shape won't change)

Rebased, now that the other merged. The PR is markedly smaller and more focused now.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @scovich and @liamzwbao for this PR.

One thing I have been thinking about is "how much should we try and optimize unshred of a single row"

Something feels off to me here about allocating individual Variant objects.

For example, what if the only way to get an unshredded value, was to unshred all the values of the whole array? This would result in a new array where all values were precomputed and could be references.

Would that allow for unshredded lists?

I am not opposed to this approach, but something just feels off to me

@scovich
Copy link
Contributor Author

scovich commented Sep 25, 2025

Thank you @scovich and @liamzwbao for this PR.

One thing I have been thinking about is "how much should we try and optimize unshred of a single row"

Something feels off to me here about allocating individual Variant objects.

For example, what if the only way to get an unshredded value, was to unshred all the values of the whole array? This would result in a new array where all values were precomputed and could be references.

Would that allow for unshredded lists?

I am not opposed to this approach, but something just feels off to me

It's definitely a good question.

Right now, VariantArray::value is the entry point everyone else uses when they need variant bytes -- including the columnar builders (including shred_variant and variant_get). Maybe the correct answer here is to move the logic out, as you say, and flip it around? If you want binary variant values, convert the array to variant and extract the resulting bytes (= super cheap)?

@scovich
Copy link
Contributor Author

scovich commented Sep 25, 2025

Right now, VariantArray::value is the entry point everyone else uses when they need variant bytes -- including the columnar builders (including shred_variant and variant_get).

Looking at shred_variant -- it only ever works with binary variant input (it checks and rejects any input that already has a typed_value column). That's easy enough to handle directly.

Looking at variant_get, it's used in one of three modes (currently conflated):

  1. If the requested path ends inside a binary variant object field, VariantArray::value just fetches the variant bytes and the builder continues pathing into the result from there.
    • Trivial to implement directly, just like for shred_variant, because there's no typed_value column.
  2. If the requested path ends on a shredded variant object field, and the output type is binary variant, thenVariantArray::value used to unshred the input
    • The efficient columnar approach to unshredding you propose makes a lot of sense (right now we're just calling value(i) in a tight loop)
  3. If the requested path ends on a shredded variant object field, and the output type is a shredded variant with a different schema, then VariantArray::value is used to first unshred the output, and then reshred it.
    • It will be a lot simpler (if not always optimal) to express reshredding as an unshred followed by a shred, rather than trying to be clever with an actual reshred kernel. if so, the columnar unshredding again makes a lot of sense.

If we want to handle the first one, we would need to drill further into the shredded variant data, attempting to follow the paths requested by the user's input shredding schema.

@scovich
Copy link
Contributor Author

scovich commented Sep 25, 2025

Once the Variant extension type merges properly, cast_to_variant and its helper make_arrow_to_variant_row_builder could potentially accept shredded variant as input, if we define an appropriate set of unshredding builders? Or (more likely) we could create a new unshred_variant method that uses the same make_arrow_to_variant_row_builder helper under the hood.

@scovich
Copy link
Contributor Author

scovich commented Sep 26, 2025

One thing I have been thinking about is "how much should we try and optimize unshred of a single row"

Something feels off to me here about allocating individual Variant objects.

For example, what if the only way to get an unshredded value, was to unshred all the values of the whole array? This would result in a new array where all values were precomputed and could be references.

Would that allow for unshredded lists?

I am not opposed to this approach, but something just feels off to me

@alamb -- I think your intuition was spot-on, PTAL?

@alamb
Copy link
Contributor

alamb commented Sep 30, 2025

Marking as draft as I don't think this one is waiting on review anymore and we are currently investigating other options

@alamb alamb marked this pull request as draft September 30, 2025 19:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet-variant parquet-variant* crates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants