Skip to content

Return array type for combine/transform/select #2569

Open
@nalimilan

Description

@nalimilan

Currently combine(gd, :x => maximum) returns a PooledArray if the input is a PooledArray, but combine(gd, :x => (x -> maximum(x))) returns an Array. We should make this consistent. (combine(gd, :x => sum) is also very slow but it's not a very common use case and it could be fixed internally; see #2564 (comment)).

For operations like first, last, maximum and minimum, returning a PooledArray is essential for performance. In general it makes sense to preserve PooledArray since operations on them are likely to give a small number of unique values (just like map(f, ::PooledArray)). As another data point, reduce preserves the type too. This is easy to achieve by calling similar on the input column.

Things are more tricky for operations on multiple columns. In the case of [:x, :y] => coalesce, returning a PooledArray when both inputs are PooledArrays sounds essential. In general a good rule could be that if the two inputs are PooledArrays then the output should also be a PooledArray. This seems to call for a more general array promotion mechanism (JuliaLang/julia#18472), but we could start with a simple rule: if the two inputs have the same container type, call similar on the first one, otherwise call Tables.allocatecolumn.

It's not clear whether other array types could benefit from such a system (note that CategoricalArray doesn't have the same problem since the CategoricalValue type is enough to choose the container type). For BitArray inputs, it would also make sense to return a BitArray (if Bools are returned), but that doesn't sounds as essential as for PooledArray. Maybe for arrays that would be stored on disk, calling similar would allow them to back the new array on disk too? That would allow to easily work with out-of-memory DataFrames.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions