-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently ColumnValueDecoderImpl and by extension ColumnReader accepts slices of [T::T] where T: DataType.
This was preserved by #1041 which extracted generics to allow using owned buffer constructions instead for the arrow read path, whilst preserving the existing API for non-arrow readers.
However, preserving this API has a couple of fairly substantial drawbacks:
- A lot of the test coverage in the parquet crate uses the arrow APIs which use different implementations of
ColumnValueDecoder - The finite capacity of the output buffers introduces challenges related to record truncation - GenericColumnReader::read_records Yields Truncated Records #5150
- The generics are pretty arcane and require some gymnastics to allow for slices that don't have a size separate from their capacity
- Buffers must be pre-allocated and zeroed ahead of time, which is not only an unnecessary overhead, but for list will likely necessitate re-allocation once the correct number of values is ascertained
Describe the solution you'd like
I would like to update ColumnValueDecoderImpl to accept Vec<T> instead of [T::T]. This would not only simplify RecordReader, and improve its performance for nested data, but would eliminate issues like #5150
Describe alternatives you've considered
Additional context