Skip to content

Variant: Rust API to Read Variant Values #7423

Closed
@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The first part of supporting the Variant type in Parquet and Arrow is
programmatic access to values encoded with the binary format described in
[VariantEncoding.md]. This ticket covers the API to read such values, but not
creating such values, or representing it using arrow or parquet which are
covered in other tickets

Describe the solution you'd like
I would like a Rust API, similar to the Json::Value and similar APIs to dynamically access variant values.

Here is some example binary data for testing:

Describe alternatives you've considered

I think a Rust enum approach with references would be a good model.

I suggest creating a new crate, arrow-variant, and marking it as
experimental, etc saying it will contain breaking changes for the next several
releases (maybe we can even version it 0.1, etc)

For example:

Sketch of structures

/// Variant value. May contain references to metadata and value
/// 'a is lifetime for metadata
/// 'b is lifetime for value
pub enum Variant<'a, 'b> {
  Variant::Null,
  Variant::Int8
  ...
  // strings are stored in the value and thus have references to that value
  Variant::String(&'b str),
  Variant::ShortString(&'b str),
  // Objects and Arrays need the metadata and values, so store both.
  Variant::Object(VariantObject<'a, 'b>),
  VariantArray(VariantArray<'a, 'b>)
}

/// Wrapper over Variant Metadata
pub struct VariantMetadata<'a> {
  metadata: &'a[u8],
  // perhaps access to header fields like dict length and is_sorted
}

/// Represents a Variant Object with references to the underlying metadata
/// and value fields
pub enum VariantObject<'a, 'b> {
  // pointer to metadata
  metadata: VariantMetadata<'a>,
  // pointer to value
  value: &'a [u8],
}

Creating Variants from buffers

// Each variant has a metadata and value buffer:
let metadata: &[u8] = ...;
let value: &[u8] = ....;
// The Rust API should NOT require allocations or copy the metadata/values
let variant = Variant::try_new(metadata, value)?;

Working with Primitive Variants

// Act based on the type of variant
match variant {
  Variant::Int8(val) => println!("The value was int8: {val}"),
  ...
  Variant::SmallString(val) => println!("The value was a small string: {val}"),
  ...
  Variant::Object(object) => {
    println("The variant was in object. The fields are:");
    for (field_name, field_value) in object.fields()? {
      // The inner field value is also a variant
      match field_value {
        Variant::...
      }
    }
  }
  // similarly for Variant::Array
}

I personally suggest doing this over a few PRs:

  1. Scaffolding: Variant struct/enum, support a few basic variant primtive types
  2. Basic nested type support: basic support for objects
  3. Array support: support for arrays
  4. Complete APIs, etc

Additional context

Open Questions:When should validation be done?

I do think there should be an API like:

/// ensure that metadata and value are valid according to the Variant spec, returns error if not.
Variant::validate(&metadata, &value)?;

However, the API sketched above proposes doing validation on access (when the
values are accessed). An alternate approach would be to validate everything on
creation and then use unchecked APIs during access.

I think validating once upfront is better if most fields are accessed or certain
fields are read multiple times. For the usecase where only some fields are read
I think verifying on access would be faster.

The spec also allows metadata to contain dictionary values that do not appear as
struct names in the variant value itself, so eager validation would potentially
verify string data uncessairly.

I suggest starting with an API that is fallible (aka creating a Variant or
accessing a field returns Result<Variant>. We can always add unsafe versions
of the APIs for usecases where validation overhead is significant (e.g. writing
utf8 validation for field names when writing json), and justified with benchmarks

Metadata

Metadata

Assignees

Labels

enhancementAny new improvement worthy of a entry in the changelog

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions