-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unify eachcol and columns functions #1590
Conversation
I had to add |
This is still WIP as I have to think if we want to make |
This should be good to have a look at now. I decided to leave the types only iterators (not Two notes:
|
I have fixed Julia 0.7 issues with broadcasting in tests. |
src/abstractdataframe/iteration.jl
Outdated
""" | ||
struct DFColumnIterator{T <: AbstractDataFrame} | ||
struct DFColumnIterator{T<:AbstractDataFrame, C} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of C
being a Boolean, it could be either AbstractVector
or Tuple{Symbol,AbstractVector}
. That way you could do <: AbstractArray{C}
here, and you wouldn't need to define eltype
and such below.
BTW, it's a good occasion to move from a tuple to a pair. That shouldn't affect most uses but it's more natural here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, if we keep the special behavior of map
which returns a DataFrame
, maybe we'd better not make this iterator an AbstractVector
. Same for the row iterator if we want to do something similar in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nalimilan Thank you for the comments. They touch on things I was not sure myself.
Actually the crucial design issue is to make them AbstractVector
subtypes or not. This is tempting and then they should carry the eltype
in the signature (actually - interestingly - this is the only way it is possible to implement it AFAIK) and the design would be simplified. If we went for this I was thinking about renaming the types to DFRowVector
and DFColumnVector
to better sell the idea what we produce.
But we have this special map
behavior that produces a DataFrame
from eachcol
. If we went for AbstractVector
subtyping I would remove this functionality. If we feel that we like its functionality I would prefer to create a new specialized method for it, e.g. having the following signature colmap(::Union{Type, Functon}, ::AbstractDataFrame
and not overload map
with a method that breaks its intuitive contract.
Also, I do not see much sense of doing the same functionality on eachrow
(we would not know how to transform it to a DataFrame
later).
In summary, having thought of it I would:
- make both types
AbstractVector
subtype - deprecate
map
foreachcol
- add
colmap
function that does whatmap
foreachcol
now does
When we agree if this is the way to go (or decide something else) I will fix this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so we need to choose between these two solutions:
- Define
map(f, columns(df))
to return a data frame. We could addmap(f, rows(df))
later if we want and that would be consistent. People have to use comprehensions to get a vector instead. - Keep the standard behavior of
map(f, columns(df))
, i.e. have it return a vector. This is basicallycolwise
, but with a more standard name (and we could deprecate it, the broadcasting formf.(columns(df))
is even more compact). We need another way to apply a function to each column of a data frame and get a data frame back: maybemapcols(f, df)
, andmapcols!(f, df)
for the in-place variant (the naming would be consistent with other column-wise functions, cf. Review row vs. column orientation of API #1514). Thenmap(f, df)
would apply a function to each row (this is what JuliaDB supports).
Also, I do not see much sense of doing the same functionality on
eachrow
(we would not know how to transform it to aDataFrame
later).
That would be quite simple AFAICT: require the user-provided function to return a named tuple, and fill a data frame with that like we now do in map(::Function, ::GroupedDataFrame)
.
Cc: @piever to discuss the consistency with JuliaDB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me to understand, there are a ColumnIterator
and RowIterator
types, and the decision is whether one should overload map
on them to return a DataFrame
rather than a vector of columns or a vector of rows?
In JuliaDB map
acts by default on the rows and returns a table
. We also have a map_rows
which basically would correspond to map_rows(f, iter...) = DataFrame(f(i) for i in iter...)
. It is useful when f
returns NamedTuples
but iter...
are not a DataFrame
(they could be vectors or iterators), for example map_rows((x, y) -> (s = x+y, d = x-y), rand(100), rand(100))
. Could this be added to DataFrames
as well? In which case I think mapcols
map_rows
is a strange pair. What could be a better pair of names?
For consistency with JuliaDB, option 2 seems much better (also, to keep map
and filter
consistent). I thought there was a colwise
or similar in JuliaDB but apparently I'm misremembering. It looks like currently one has to do:
julia> t = table((a = rand(100), b = randn(100)));
julia> f(v) = partialsort(v, 1:10)
f (generic function with 1 method)
julia> table(map(f, columns(t)))
Table with 10 rows, 2 columns:
a b
────────────────────
0.00392291 -2.52979
0.0221383 -2.01489
0.029867 -1.92054
0.0462468 -1.89799
0.0720458 -1.5951
0.0824684 -1.31414
0.105048 -1.25503
0.108803 -1.19262
0.1204 -1.13395
0.136053 -1.13125
This I think is not so bad but maybe it'd be more awkward in DataFrames because DataFrame(map(f, columns(t)))
would lose the column names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In JuliaDB
map
acts by default on the rows and returns atable
. We also have amap_rows
which basically would correspond tomap_rows(f, iter...) = DataFrame(f(i) for i in iter...)
. It is useful whenf
returnsNamedTuples
butiter...
are not aDataFrame
(they could be vectors or iterators), for examplemap_rows((x, y) -> (s = x+y, d = x-y), rand(100), rand(100))
. Could this be added toDataFrames
as well? In which case I thinkmapcols
map_rows
is a strange pair. What could be a better pair of names?
I don't really see the advantage of providing a map_rows
function if DataFrame(f(i) for i in iter...)
does the same thing TBH. And using a generator like that sounds more Julian (it's like creating dicts).
For consistency with JuliaDB, option 2 seems much better (also, to keep
map
andfilter
consistent).
OK. That's good since that's the approach @bkamins has implemented. ;-)
But the example you show below goes a bit against with this statement AFAICT: map(f, columns(t))
returns a NamedTuple
in JuliaDB, not a vector (which is why names are lost). That's logical since column(t)
gives a NamedTuple
, but that would be different from data frames. That said, it's not a big deal. Maybe JuliaDB could support mapcols
just for consistency and convenience?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, maybe JuliaDB could deprecate map_rows
as "unnecessary". The easiest is probably to add a mapcols(f, t; select = All())
method to JuliaDB. As I mentioned above, I'm afraid this creates a bit of cacophony on the JuliaDB side (mapcols
and map_rows
have different underscore style and are not related in the obvious way). The cleanest is probably to deprecate (or at least not export) map_rows
and rename setcol
, popcol
, pushcol
... to setcols
, popcols
, pushcols
(as they accept multiple columns at once) and add mapcols
to the collections. But this PR shouldn't worry about that as all the changes to make things consistent should come from JuliaDB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. It's good to anticipate to what API we both want to converge in the long term.
Co-Authored-By: bkamins <bkamins@sgh.waw.pl>
Co-Authored-By: bkamins <bkamins@sgh.waw.pl>
Having thought about it, I will go on with what I have suggested and then it lend under review. |
OK - in 52e9010 you have what had to be changed to better support this intermediate deprecation. |
b33456e
to
e42b8d3
Compare
@nalimilan Having tried what you propose I now have a basic example when So - in my opinion - we have to be breaking here, unless you have some better recommendation. |
Hmm... I guess we can also override |
src/abstractdataframe/iteration.jl
Outdated
Base.size(itr::DataFrameColumns) = (size(itr.df, 2),) | ||
Base.IndexStyle(::Type{<:DataFrameColumns}) = Base.IndexLinear() | ||
|
||
function Base.getindex(itr::DataFrameColumns{<:AbstractDataFrame, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@propagate_inbounds
wouldn't hurt here and below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK it will not change anything significantly as we do boundschecking anyway when we try to access column number j
from df
(note that df
is an AbstractDataFrame
so we do not know at this point how it stores data).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but propagating this is better anyway, and the data frame implementation could use @boundscheck
at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have opened a separate issue #1594 for this as this PR is complex enough and I am not sure if it will help anything (performancewise - disabling checking bounds saves us nanoseconds but then we have a dynamic dispatch on the result which is orders of magnitude more expensive).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not saying it matters, but it's cheap to add, so...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - I will add, but I leave the issue open for further analysis.
Co-Authored-By: bkamins <bkamins@sgh.waw.pl>
Co-Authored-By: bkamins <bkamins@sgh.waw.pl>
I will override In summary: after we merge this, what other PRs would you want to finish before tagging a release. After you tag a release I will immediately submit a PRs (probably it is better to have several for different types of deprecations) cleaning up all this mess. |
Note that |
@@ -138,11 +138,17 @@ function Base.iterate(itr::DataFrameColumns{<:AbstractDataFrame, | |||
return (_names(itr.df)[j] => itr.df[j], j + 1) | |||
end | |||
|
|||
# TODO: remove this after deprecation period of getindex of DataFrameColumns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not put this in deprecated.jl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be simpler for me to clean it up later (and I assume I will do it just after the coming release).
deprecated.jl has 62kb and almost 2000 loc now and actually it is easy to forget to fix something there.
I'd like to merge #1591 and finish a PR to speed up grouping on categorical arrays. Then yes, it would be nice to clean up all this mess. |
test/grouping.jl
Outdated
# f4(df) = [maximum(df[:c]), minimum(df[:c])] | ||
# f5(df) = reshape([maximum(df[:c]), minimum(df[:c])], 2, 1) | ||
# f6(df) = [maximum(df[:c]) minimum(df[:c])] | ||
f1(df) = DataFrame(cmax = maximum(df[:c])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this backwards?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - but I thought this is what you wanted me to do saying that this will be handled in a separate PR. I will revert it. Hold on several minutes with reviewing please I am pushing things following your earlier comments and will let you know when I am finished
@nalimilan - thank you for the patience with this PR. I believe the having a first-class views of columns and rows of a I have additionally opened #1595 to discuss the future of |
You're welcome! |
If there will be no additional comments to this PR I will merge it tomorrow. |
Implementing what was discussed in #1586.