Skip to content

Optimized RecordBatch for constant columns #1248

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Certain record batches have columns that entirely the same value -- for example the partitioning column in #1141

In that case the ListingProvider adds a column with a constant value based on the path in the file system

So a file in a path like /foo/date=2021-11-5/bar.parquet would end up producing one or more RecordBatches with a date column that had the value 2021-11-5.

At the moment, to represent such a column using an Arrow array we have to repeat the value 2021-11-5 over and over again which is inefficient both in space storage as well as in processing time.

date other_col
2021-11-5 1
2021-11-5 2
2021-11-5 3
2021-11-5 4
... ...
2021-11-5 1000000000

Describe the solution you'd like

It seems that having a sort of RecordBatch that can contain constant columns represented by a scalar value (like the C++ ExecBatch) is a pretty nice abstraction that will help adding tons of optimizations. I didn't find an issue summarizing this discussion and conclusion, should we create one?

According to @westonpace , in C++ there is ExecBatch in addition to RecordBatch for this purpose. ExecBatch is used within the exec plan and it's pretty much identical to record batch except:

  • No schema
  • No field names
  • Column values can be scalars

Note DataFusion already has a Array/Scalar notion in ColumnarValue https://github.com/apache/arrow-datafusion/blob/a1c794cec233f7fe34f34c7d64f529625a507669/datafusion/src/physical_plan/mod.rs#L388-L395 but its use is confined to expression evaluation at the moment.

Perhaps we could build something like DFRecordBatch akin to DFSchema (aka wrap the arrow RecordBatch with methods that allow the columns to be ColumnarValue)

Describe alternatives you've considered
We could use a Dictionary(Int8, Utf8) array to do better (where each repeated string value only requires a single i8 (8 bytes) but it still requires ~ 8 bytes * the number of total rows in the RecordBatch

Another classic approach I have seen for this kind of problem is to use RLE encoding so that these “almost entirely constant columns” become a very small number of RLE runs.

I think RLE has been proposed for Arrow several times before, but it doesn’t have the “access array[i]” in constant time” property required for arrays

Additional context
(this is my transcription of some information in an ASF slack thread: https://the-asf.slack.com/archives/C01QUFS30TD/p1634810772007000)

cc @houqp @rdettai @Dandandan

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions