Description
Currently combine(gd, :x => maximum)
returns a PooledArray
if the input is a PooledArray
, but combine(gd, :x => (x -> maximum(x)))
returns an Array
. We should make this consistent. (combine(gd, :x => sum)
is also very slow but it's not a very common use case and it could be fixed internally; see #2564 (comment)).
For operations like first
, last
, maximum
and minimum
, returning a PooledArray
is essential for performance. In general it makes sense to preserve PooledArray
since operations on them are likely to give a small number of unique values (just like map(f, ::PooledArray)
). As another data point, reduce
preserves the type too. This is easy to achieve by calling similar
on the input column.
Things are more tricky for operations on multiple columns. In the case of [:x, :y] => coalesce
, returning a PooledArray
when both inputs are PooledArray
s sounds essential. In general a good rule could be that if the two inputs are PooledArray
s then the output should also be a PooledArray
. This seems to call for a more general array promotion mechanism (JuliaLang/julia#18472), but we could start with a simple rule: if the two inputs have the same container type, call similar
on the first one, otherwise call Tables.allocatecolumn
.
It's not clear whether other array types could benefit from such a system (note that CategoricalArray
doesn't have the same problem since the CategoricalValue
type is enough to choose the container type). For BitArray
inputs, it would also make sense to return a BitArray
(if Bool
s are returned), but that doesn't sounds as essential as for PooledArray
. Maybe for arrays that would be stored on disk, calling similar
would allow them to back the new array on disk too? That would allow to easily work with out-of-memory DataFrames.