Optimized  `RecordBatch` for constant columns

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
Certain record batches have columns that entirely the same value -- for example the partitioning column in https://github.com/apache/arrow-datafusion/pull/1141

In that case the `ListingProvider` adds a column with a constant value based on the path in the file system

So a file in a path like `/foo/date=2021-11-5/bar.parquet` would end up producing one or more RecordBatches with a `date` column that had the value `2021-11-5`. 

At the moment, to represent such a column using an Arrow array we have to repeat the value `2021-11-5` over and over again which is inefficient both in space storage as well as in processing time. 

| date | other_col |
| -----| ----------|
| 2021-11-5 | 1 |
| 2021-11-5 | 2 |
| 2021-11-5 | 3 |
| 2021-11-5 | 4 |
| ... | ... |
| 2021-11-5 | 1000000000 |


**Describe the solution you'd like**


It seems that having a sort of RecordBatch that can contain constant columns represented by a scalar value (like the C++ ExecBatch) is a pretty nice abstraction that will help adding tons of optimizations. I didn't find an issue summarizing this discussion and conclusion, should we create one?

According to @westonpace  , in C++ there is `ExecBatch` in addition to `RecordBatch` for this purpose.  `ExecBatch` is used within the exec plan and it's pretty much identical to record batch except:

* No schema
* No field names
* Column values can be scalars

Note DataFusion already has a Array/Scalar notion in `ColumnarValue` https://github.com/apache/arrow-datafusion/blob/a1c794cec233f7fe34f34c7d64f529625a507669/datafusion/src/physical_plan/mod.rs#L388-L395 but its use is confined to expression evaluation at the moment. 

Perhaps we could build something like `DFRecordBatch` akin to `DFSchema` (aka wrap the arrow `RecordBatch` with methods that allow the columns to be `ColumnarValue`)

**Describe alternatives you've considered**
We could use a `Dictionary(Int8, Utf8)` array to do better (where each repeated string value only requires a single `i8` (8 bytes) but it still requires ~ 8 bytes * the number of total rows in the `RecordBatch`

Another  classic approach I have seen for this kind of problem is to use [RLE encoding](https://en.wikipedia.org/wiki/Run-length_encoding) so that these “almost entirely constant columns” become a very small number of RLE runs.

I think RLE has been proposed for Arrow several times before, but it doesn’t have the “access array[i]” in constant time” property required for arrays

**Additional context**
(this is my transcription of some information in an ASF slack thread: https://the-asf.slack.com/archives/C01QUFS30TD/p1634810772007000) 

cc @houqp  @rdettai @Dandandan 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Optimized `RecordBatch` for constant columns #1248

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

date	other_col
2021-11-5	1
2021-11-5	2
2021-11-5	3
2021-11-5	4
...	...
2021-11-5	1000000000

Uh oh!

Optimized RecordBatch for constant columns #1248

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Optimized `RecordBatch` for constant columns #1248