Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unify eachcol and columns functions #1590
Unify eachcol and columns functions #1590
Changes from 18 commits
4f9c61a
65a2d1e
c9b8e88
ebc6835
97af6cc
a470817
17b28a0
c3663bf
5a2e30c
4da925c
1cd3d3c
0fd91aa
7f10bba
717f47b
22f2b90
32f2fb2
4bd4f3c
f8a4dab
f584ffb
9bda15d
a317e77
a776b64
c882551
4536033
91edc19
f57af37
4566656
ceb697c
82aa836
52e9010
e42b8d3
b63ccba
4576df6
213106f
d9c8a30
e1cb969
912f1b4
2363ef5
6d1eb90
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of
C
being a Boolean, it could be eitherAbstractVector
orTuple{Symbol,AbstractVector}
. That way you could do<: AbstractArray{C}
here, and you wouldn't need to defineeltype
and such below.BTW, it's a good occasion to move from a tuple to a pair. That shouldn't affect most uses but it's more natural here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, if we keep the special behavior of
map
which returns aDataFrame
, maybe we'd better not make this iterator anAbstractVector
. Same for the row iterator if we want to do something similar in the future.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nalimilan Thank you for the comments. They touch on things I was not sure myself.
Actually the crucial design issue is to make them
AbstractVector
subtypes or not. This is tempting and then they should carry theeltype
in the signature (actually - interestingly - this is the only way it is possible to implement it AFAIK) and the design would be simplified. If we went for this I was thinking about renaming the types toDFRowVector
andDFColumnVector
to better sell the idea what we produce.But we have this special
map
behavior that produces aDataFrame
fromeachcol
. If we went forAbstractVector
subtyping I would remove this functionality. If we feel that we like its functionality I would prefer to create a new specialized method for it, e.g. having the following signaturecolmap(::Union{Type, Functon}, ::AbstractDataFrame
and not overloadmap
with a method that breaks its intuitive contract.Also, I do not see much sense of doing the same functionality on
eachrow
(we would not know how to transform it to aDataFrame
later).In summary, having thought of it I would:
AbstractVector
subtypemap
foreachcol
colmap
function that does whatmap
foreachcol
now doesWhen we agree if this is the way to go (or decide something else) I will fix this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so we need to choose between these two solutions:
map(f, columns(df))
to return a data frame. We could addmap(f, rows(df))
later if we want and that would be consistent. People have to use comprehensions to get a vector instead.map(f, columns(df))
, i.e. have it return a vector. This is basicallycolwise
, but with a more standard name (and we could deprecate it, the broadcasting formf.(columns(df))
is even more compact). We need another way to apply a function to each column of a data frame and get a data frame back: maybemapcols(f, df)
, andmapcols!(f, df)
for the in-place variant (the naming would be consistent with other column-wise functions, cf. Review row vs. column orientation of API #1514). Thenmap(f, df)
would apply a function to each row (this is what JuliaDB supports).That would be quite simple AFAICT: require the user-provided function to return a named tuple, and fill a data frame with that like we now do in
map(::Function, ::GroupedDataFrame)
.Cc: @piever to discuss the consistency with JuliaDB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me to understand, there are a
ColumnIterator
andRowIterator
types, and the decision is whether one should overloadmap
on them to return aDataFrame
rather than a vector of columns or a vector of rows?In JuliaDB
map
acts by default on the rows and returns atable
. We also have amap_rows
which basically would correspond tomap_rows(f, iter...) = DataFrame(f(i) for i in iter...)
. It is useful whenf
returnsNamedTuples
butiter...
are not aDataFrame
(they could be vectors or iterators), for examplemap_rows((x, y) -> (s = x+y, d = x-y), rand(100), rand(100))
. Could this be added toDataFrames
as well? In which case I thinkmapcols
map_rows
is a strange pair. What could be a better pair of names?For consistency with JuliaDB, option 2 seems much better (also, to keep
map
andfilter
consistent). I thought there was acolwise
or similar in JuliaDB but apparently I'm misremembering. It looks like currently one has to do:This I think is not so bad but maybe it'd be more awkward in DataFrames because
DataFrame(map(f, columns(t)))
would lose the column names?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really see the advantage of providing a
map_rows
function ifDataFrame(f(i) for i in iter...)
does the same thing TBH. And using a generator like that sounds more Julian (it's like creating dicts).OK. That's good since that's the approach @bkamins has implemented. ;-)
But the example you show below goes a bit against with this statement AFAICT:
map(f, columns(t))
returns aNamedTuple
in JuliaDB, not a vector (which is why names are lost). That's logical sincecolumn(t)
gives aNamedTuple
, but that would be different from data frames. That said, it's not a big deal. Maybe JuliaDB could supportmapcols
just for consistency and convenience?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, maybe JuliaDB could deprecate
map_rows
as "unnecessary". The easiest is probably to add amapcols(f, t; select = All())
method to JuliaDB. As I mentioned above, I'm afraid this creates a bit of cacophony on the JuliaDB side (mapcols
andmap_rows
have different underscore style and are not related in the obvious way). The cleanest is probably to deprecate (or at least not export)map_rows
and renamesetcol
,popcol
,pushcol
... tosetcols
,popcols
,pushcols
(as they accept multiple columns at once) and addmapcols
to the collections. But this PR shouldn't worry about that as all the changes to make things consistent should come from JuliaDB.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. It's good to anticipate to what API we both want to converge in the long term.