-
Couldn't load subscription status.
- Fork 1.7k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Certain record batches have columns that entirely the same value -- for example the partitioning column in #1141
In that case the ListingProvider adds a column with a constant value based on the path in the file system
So a file in a path like /foo/date=2021-11-5/bar.parquet would end up producing one or more RecordBatches with a date column that had the value 2021-11-5.
At the moment, to represent such a column using an Arrow array we have to repeat the value 2021-11-5 over and over again which is inefficient both in space storage as well as in processing time.
| date | other_col |
|---|---|
| 2021-11-5 | 1 |
| 2021-11-5 | 2 |
| 2021-11-5 | 3 |
| 2021-11-5 | 4 |
| ... | ... |
| 2021-11-5 | 1000000000 |
Describe the solution you'd like
It seems that having a sort of RecordBatch that can contain constant columns represented by a scalar value (like the C++ ExecBatch) is a pretty nice abstraction that will help adding tons of optimizations. I didn't find an issue summarizing this discussion and conclusion, should we create one?
According to @westonpace , in C++ there is ExecBatch in addition to RecordBatch for this purpose. ExecBatch is used within the exec plan and it's pretty much identical to record batch except:
- No schema
- No field names
- Column values can be scalars
Note DataFusion already has a Array/Scalar notion in ColumnarValue https://github.com/apache/arrow-datafusion/blob/a1c794cec233f7fe34f34c7d64f529625a507669/datafusion/src/physical_plan/mod.rs#L388-L395 but its use is confined to expression evaluation at the moment.
Perhaps we could build something like DFRecordBatch akin to DFSchema (aka wrap the arrow RecordBatch with methods that allow the columns to be ColumnarValue)
Describe alternatives you've considered
We could use a Dictionary(Int8, Utf8) array to do better (where each repeated string value only requires a single i8 (8 bytes) but it still requires ~ 8 bytes * the number of total rows in the RecordBatch
Another classic approach I have seen for this kind of problem is to use RLE encoding so that these “almost entirely constant columns” become a very small number of RLE runs.
I think RLE has been proposed for Arrow several times before, but it doesn’t have the “access array[i]” in constant time” property required for arrays
Additional context
(this is my transcription of some information in an ASF slack thread: https://the-asf.slack.com/archives/C01QUFS30TD/p1634810772007000)