Skip to content

Use Vec instead of Slice in ColumnReader #5177

@tustvold

Description

@tustvold

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Currently ColumnValueDecoderImpl and by extension ColumnReader accepts slices of [T::T] where T: DataType.

This was preserved by #1041 which extracted generics to allow using owned buffer constructions instead for the arrow read path, whilst preserving the existing API for non-arrow readers.

However, preserving this API has a couple of fairly substantial drawbacks:

  • A lot of the test coverage in the parquet crate uses the arrow APIs which use different implementations of ColumnValueDecoder
  • The finite capacity of the output buffers introduces challenges related to record truncation - GenericColumnReader::read_records Yields Truncated Records #5150
  • The generics are pretty arcane and require some gymnastics to allow for slices that don't have a size separate from their capacity
  • Buffers must be pre-allocated and zeroed ahead of time, which is not only an unnecessary overhead, but for list will likely necessitate re-allocation once the correct number of values is ascertained

Describe the solution you'd like

I would like to update ColumnValueDecoderImpl to accept Vec<T> instead of [T::T]. This would not only simplify RecordReader, and improve its performance for nested data, but would eliminate issues like #5150

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

enhancementAny new improvement worthy of a entry in the changelogparquetChanges to the parquet crate

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions