You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently combine(gd, :x => maximum) returns a PooledArray if the input is a PooledArray, but combine(gd, :x => (x -> maximum(x))) returns an Array. We should make this consistent. (combine(gd, :x => sum) is also very slow but it's not a very common use case and it could be fixed internally; see #2564 (comment)).
For operations like first, last, maximum and minimum, returning a PooledArray is essential for performance. In general it makes sense to preserve PooledArray since operations on them are likely to give a small number of unique values (just like map(f, ::PooledArray)). As another data point, reduce preserves the type too. This is easy to achieve by calling similar on the input column.
Things are more tricky for operations on multiple columns. In the case of [:x, :y] => coalesce, returning a PooledArray when both inputs are PooledArrays sounds essential. In general a good rule could be that if the two inputs are PooledArrays then the output should also be a PooledArray. This seems to call for a more general array promotion mechanism (JuliaLang/julia#18472), but we could start with a simple rule: if the two inputs have the same container type, call similar on the first one, otherwise call Tables.allocatecolumn.
It's not clear whether other array types could benefit from such a system (note that CategoricalArray doesn't have the same problem since the CategoricalValue type is enough to choose the container type). For BitArray inputs, it would also make sense to return a BitArray (if Bools are returned), but that doesn't sounds as essential as for PooledArray. Maybe for arrays that would be stored on disk, calling similar would allow them to back the new array on disk too? That would allow to easily work with out-of-memory DataFrames.
The text was updated successfully, but these errors were encountered:
I would treat return type of the reduction as an implementation detail. I am marking it 1.x release, as it seems it is non-essential to have it for 1.0 release.
@nalimilan - if you feel otherwise please comment and we can go back to it.
Currently
combine(gd, :x => maximum)
returns aPooledArray
if the input is aPooledArray
, butcombine(gd, :x => (x -> maximum(x)))
returns anArray
. We should make this consistent. (combine(gd, :x => sum)
is also very slow but it's not a very common use case and it could be fixed internally; see #2564 (comment)).For operations like
first
,last
,maximum
andminimum
, returning aPooledArray
is essential for performance. In general it makes sense to preservePooledArray
since operations on them are likely to give a small number of unique values (just likemap(f, ::PooledArray)
). As another data point,reduce
preserves the type too. This is easy to achieve by callingsimilar
on the input column.Things are more tricky for operations on multiple columns. In the case of
[:x, :y] => coalesce
, returning aPooledArray
when both inputs arePooledArray
s sounds essential. In general a good rule could be that if the two inputs arePooledArray
s then the output should also be aPooledArray
. This seems to call for a more general array promotion mechanism (JuliaLang/julia#18472), but we could start with a simple rule: if the two inputs have the same container type, callsimilar
on the first one, otherwise callTables.allocatecolumn
.It's not clear whether other array types could benefit from such a system (note that
CategoricalArray
doesn't have the same problem since theCategoricalValue
type is enough to choose the container type). ForBitArray
inputs, it would also make sense to return aBitArray
(ifBool
s are returned), but that doesn't sounds as essential as forPooledArray
. Maybe for arrays that would be stored on disk, callingsimilar
would allow them to back the new array on disk too? That would allow to easily work with out-of-memory DataFrames.The text was updated successfully, but these errors were encountered: