Skip to content

[Variant] Test and implement efficient building for "large" Arrays #7699

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

The Variant spec uses different numbers of bytes for encoding / writing small/large arrays For example, for an array, the encoding looks like this (note the num_elements is either 1 or 4 bytes): https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-data-for-array-basic_type3

The size in bytes of num_elements is indicated by is_large in the value_header.

Likewise, the number of bytes used for a field_offset depends on the total number of elements in the array

                   7                     0
                  +-----------------------+
array value_data  |                       |
                  :     num_elements      :  <-- unsigned little-endian, 1 or 4 bytes
                  |                       |
                  +-----------------------+
                  |                       |
                  :     field_offset      :  <-- unsigned little-endian, `field_offset_size` bytes
                  |                       |
                  +-----------------------+
                              :
                  +-----------------------+
                  |                       |
                  :     field_offset      :  <-- unsigned little-endian, `field_offset_size` bytes
                  |                       |      (`num_elements + 1` field_offsets)
                  +-----------------------+
                  |                       |
                  :         value         :
                  |                       |
                  +-----------------------+
                              :
                  +-----------------------+
                  |                       |
                  :         value         :  <-- (`num_elements` values)
                  |                       |
                  +-----------------------+

As described by @scovich on @PinkCrow007's PR: #7653 (comment)

he value offset and field id arrays require either knowing the number of elements/fields to be created in advance (and then worrying about what happens if the caller builds too many/few entries afterward), or building the arrays in separate storage and then moving an arbitrarily large number of buffered bytes to make room for the them after the fact.

A similar issue exists for Objects. Hopefully by designing a pattern for Arrays we'll then also have a way to implement it for Objects as well

Describe the solution you'd like
I would like

  1. Examples of creating Arrays with more than 256 values (the number of offsets that can be encoded in a u8)
  2. APIs that allow efficient construction of such Array values

Describe alternatives you've considered

Maybe the builder can leave room for the list length and then append the values, and then go back and update the length when the list is finished. This would get tricky for building "large" lists as the length field may not be known upfront.

Specialized Functions

we could also introduce potentially a function like new_large_object() or something for callers to hint up front their object has many fields, and if they use new_object but push too many values fallback to copying

I think many clients would have knowledge of the number of fields and could then decide on the appropriate API

Additional context

Metadata

Metadata

Labels

enhancementAny new improvement worthy of a entry in the changelogparquetChanges to the parquet crate

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions