Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return array type for combine/transform/select #2569

Open
nalimilan opened this issue Dec 1, 2020 · 1 comment
Open

Return array type for combine/transform/select #2569

nalimilan opened this issue Dec 1, 2020 · 1 comment
Labels
Milestone

Comments

@nalimilan
Copy link
Member

Currently combine(gd, :x => maximum) returns a PooledArray if the input is a PooledArray, but combine(gd, :x => (x -> maximum(x))) returns an Array. We should make this consistent. (combine(gd, :x => sum) is also very slow but it's not a very common use case and it could be fixed internally; see #2564 (comment)).

For operations like first, last, maximum and minimum, returning a PooledArray is essential for performance. In general it makes sense to preserve PooledArray since operations on them are likely to give a small number of unique values (just like map(f, ::PooledArray)). As another data point, reduce preserves the type too. This is easy to achieve by calling similar on the input column.

Things are more tricky for operations on multiple columns. In the case of [:x, :y] => coalesce, returning a PooledArray when both inputs are PooledArrays sounds essential. In general a good rule could be that if the two inputs are PooledArrays then the output should also be a PooledArray. This seems to call for a more general array promotion mechanism (JuliaLang/julia#18472), but we could start with a simple rule: if the two inputs have the same container type, call similar on the first one, otherwise call Tables.allocatecolumn.

It's not clear whether other array types could benefit from such a system (note that CategoricalArray doesn't have the same problem since the CategoricalValue type is enough to choose the container type). For BitArray inputs, it would also make sense to return a BitArray (if Bools are returned), but that doesn't sounds as essential as for PooledArray. Maybe for arrays that would be stored on disk, calling similar would allow them to back the new array on disk too? That would allow to easily work with out-of-memory DataFrames.

@bkamins bkamins added this to the 1.0 milestone Dec 1, 2020
@bkamins bkamins mentioned this issue Mar 4, 2021
19 tasks
@bkamins
Copy link
Member

bkamins commented Mar 27, 2021

I would treat return type of the reduction as an implementation detail. I am marking it 1.x release, as it seems it is non-essential to have it for 1.0 release.

@nalimilan - if you feel otherwise please comment and we can go back to it.

@bkamins bkamins modified the milestones: 1.0, 1.x Mar 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants