Skip to content

[Variant]: Rust API to Create Variant Values #7424

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Part of supporting the Variant type in Parquet and Arrow is programmatically
creating values in the binary format described in VariantEncoding.md. This
is important in the short term for writing tests, as well as for converting from
other types (specifically JSON).

Note this ticket covers the API to create such values, but not reading them
(see #7423) or reading/writing variant values to JSON.

Describe the solution you'd like

What I would like is a Rust API, that can efficiently create such values. I
think it is also important to design an API that supports reusing the metadata.

Describe alternatives you've considered

What I suggest is a Builder-style API, modeled on the Arrow array builder APIs
such as StringBuilder that can efficiently create Variant values.

For example:

// Location to write metadata
// Should be anything that implements std::io::Write or a trait
let mut metadata_buffer = vec![]
// Create a builder for constructing variant values
let builder = VariantBuilder::new(&mut metadata_buffer);

Example creating a primitive Variant value`:

// Create the equivalent of {"foo": 1, "bar": 100}
let mut value_buffer = vec![];
let mut object_builder = builder.new_object(&mut value_buffer); // object_builder has reference to builder
object_builder.append_value("foo", 1);
object_builder.append_value("bar", 100);
object_builder.finish();
// value_buffer now contains a valid variant 🎉
// builder contains a metadata header with fields "foo" and "bar"

Example of creating a nested VariantValue:

Here is how we might create an Object:

// Create nested object: the equivalent of {"foo": {"bar": 100}}
// note we haven't finalized the metadata yet so we reuse it here
let mut value_buffer2 = vec![];
let mut object_builder2 = builder.new_object(&mut value_buffer);
let mut foo_object_builder = object_builder.append_object("bar"); // builder for "bar"
foo_object_builder.append_value("bar", 100);
foo_object_builder.finish();
object_builder.finish();
// value_buffer2 contains a valid variant

Finish the builder to finalize the metadata

When the builder is finished, it finalizes / writes metadata as needed.

// complete writing the metadata
builder.finish();
// metadata_buffer contains valid variant metadata bytes

Considerations:

Reusing metadata

The metadata mostly contains a dictionary of field names, and so I believe an
important optimization will be reusing the same metadata to create multiple
values. For example the three following JSON values can use the same metadata
(with field names "foo" and "bar"):

{
"foo": 1,
"bar": 100
}
{
"foo": 2,
"bar": 200
}
{
"foo": 3,
}

Sorted dictionaries:

The metadata encoding spec permits writing sorted dictionaries in the metadata
header. However, when writing sorted dictionaries, once an object has been
created, it is in general not possible to add new metadata dictionary values
because the variant object value itself contains offsets to the dictionary, and thus inserting any new values into
the metadata would invalidate it.

One API that might work would be to supply a pre-existing metadata to the builder
and reusing that when possible and creating an new metadata when it isn't

Additional context

Metadata

Metadata

Assignees

Labels

arrowChanges to the arrow crateenhancementAny new improvement worthy of a entry in the changelogparquetChanges to the parquet crate

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions