-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The Variant spec uses different numbers of bytes for encoding / writing small/large arrays For example, for an array, the encoding looks like this (note the num_elements is either 1 or 4 bytes): https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-data-for-array-basic_type3
The size in bytes of
num_elementsis indicated byis_largein the value_header.
Likewise, the number of bytes used for a field_offset depends on the total number of elements in the array
7 0
+-----------------------+
array value_data | |
: num_elements : <-- unsigned little-endian, 1 or 4 bytes
| |
+-----------------------+
| |
: field_offset : <-- unsigned little-endian, `field_offset_size` bytes
| |
+-----------------------+
:
+-----------------------+
| |
: field_offset : <-- unsigned little-endian, `field_offset_size` bytes
| | (`num_elements + 1` field_offsets)
+-----------------------+
| |
: value :
| |
+-----------------------+
:
+-----------------------+
| |
: value : <-- (`num_elements` values)
| |
+-----------------------+
As described by @scovich on @PinkCrow007's PR: #7653 (comment)
he value offset and field id arrays require either knowing the number of elements/fields to be created in advance (and then worrying about what happens if the caller builds too many/few entries afterward), or building the arrays in separate storage and then moving an arbitrarily large number of buffered bytes to make room for the them after the fact.
A similar issue exists for Objects. Hopefully by designing a pattern for Arrays we'll then also have a way to implement it for Objects as well
Describe the solution you'd like
I would like
- Examples of creating Arrays with more than 256 values (the number of offsets that can be encoded in a u8)
- APIs that allow efficient construction of such Array values
Describe alternatives you've considered
Maybe the builder can leave room for the list length and then append the values, and then go back and update the length when the list is finished. This would get tricky for building "large" lists as the length field may not be known upfront.
Specialized Functions
we could also introduce potentially a function like new_large_object() or something for callers to hint up front their object has many fields, and if they use new_object but push too many values fallback to copying
I think many clients would have knowledge of the number of fields and could then decide on the appropriate API
Additional context