-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Review row vs. column orientation of API #1514
Comments
I like all these changes. I think of in stata how I also like getting rid of |
In
And I've been pretty happy with it: getting a row is Is it really necessary to keep the one argument I don't have a clear view of the whole situation but I wanted to share the following:
A key difference however is that the inplace version doesn't make sense for IndexedTables. I wonder if simply a version of
|
Thanks for the comments.
One issue is that You're also right that AFAICT the absence of parameterization on the column types makes it difficult to iterate over rows of a data frame efficiently.
Interesting. That's indeed an issue I've thought about several times in DataFramesMeta. AFAIK people generally want row-wise operations, except for two particular cases: 1) reductions (like
Don't you find it a bit problematic that
Yeah, why not. But that's really not a big issue, as it can be added later.
Indeed, that's annoying. I'm not sure whether the plural or the singular is best, but least the naming should be consistent everywhere. |
IIUC, r = Tables.rows(df)
r[i] to have fast row iteration.
There are also some other cases (say sorting or things like that) but esp. in JuliaDB they are tricky anyway because they don't generalize to the distributed case as easily. Viceversa sometimes when one uses the vectorised version, it could probably be done with for example OnlineStats in a way that parallelize naturally and works in distributed scenarios. I'd be definitely happy to discuss this further though, I'll open an issue in JuliaDBMeta.
I think it should to be
|
If we change this then |
|
Once the Tables PR is merged, I imagine I think the following is quite nice and will work with all table types: for v in columns(df)
...
end
for (key, val) in pairs(columns(df))
...
end
for row in rows(df)
...
end |
Well, yeah, constructing a type-stable iterator allows fast iteration over rows. But that only helps if there's a function barrier to allow for specialization. Also it's not clear whether it's reasonable to specialize on the type of all columns when there might be hundreds of them. See #1335. That's why macros like DataFramesMeta/JuliaDBMeta seem promising to me: they allow specializing only on the relevant columns, which make it easier for the compiler. |
I know they are deprecated, but now we could assign them meaning consistent with the interpretation that is chosen. Using Also Finally there are Of course we do not have to do everything in one shot. The major decision is that we treat |
I've never worked with "very large" data, so I'm not familiar with this type of issues, but I wanted to mention that Tables is quite smart in its implementation in that the row is just a lazy wrapper of the table, so you don't actually materialize it and |
The issue isn't whether data is materialized or not, but just that AFAIK the compiler bails out beyond a certain number of columns (unless they are all of the same type, i.e. an |
+1
Which ones? There was a very old |
I'm generally in favor of treating tables as collections of rows at the API level. |
I did not mean
and I think a review of them for consistency with those changes would be good. In general my thinking is to simplify DataFrames.jl API and minimize it as much as possible, as some things that were needed when this package was created probably could be removed now given the whole data wrangling ecosystem we have. |
I like it — the only thing is that maybe vars/variables should be used instead of the col/columns suffix. |
Seems going for collection of rows is a choice that dominates. So the only leftover we would have is Finally there is a question if we want |
Yes but we could have
Good question. There's no hurry to support it, but that would indeed be consistent. That's actually one of the rare cases where operating over columns by default would probably be more useful, but we can provide other ways to do that like |
Actually if we go for row-based approach |
#1560 gets rid of the most problematic methods. We can discuss what to do about others in separate PRs. |
I just wanted to add my support for the way this is evolving! ❤️ I love the "relation is a collection of rows" style of interface. If the
I suppose if we support |
Actually with #1590 we're taking the stance that |
To my understanding we introduced |
Ok, yeah that makes sense. |
Everything seems to be complete here. The general rule we follow is that generic collection functions that operate on elements (
|
Actually JuliaDB has recently deprecated |
Should we then define CC @quinnj |
I haven't been following the discussion for quite some time - but what is the future of It seems that We have I guess the collection of implemented functions seems partially complete to me... @nalimilan would you welcome contributions here? |
We haven't really decided whether iteration (and |
Oh I see, broadcasting works over each cell. Thanks |
I'd be up for supporting something like |
The way I think about it is that But do you mean then to define the bodies of |
Yeah, I meant we could just have |
Following JuliaData/DataFrames.jl#1514 (comment). The open question is if we want both `rename` and `rename!` in the common API (`rename` is probably more universally needed, `rename!` is applicable in DataFrames.jl but not in contexts where table does not allow changing column names in-place). I propose to have both but please comment (`rename!` can also live in DataFrames.jl only otherwise) CC @piever - for syncing with JuliaDB.jl.
OK - so I am closing this issue and opened JuliaData/Tables.jl#119. |
It's been noted several times that we don't have a consistent view on whether a data frame is a collection of rows or of columns. We should decide which one it is so that all exported functions and collection functions from Base we implement operate either on rows or on columns. Then functions which operate on the other dimension should mention it explicitly in their names, e.g. with the suffix
rows
orcols
(as in IndexedTables).See also: #406, #1200, #1459, #1513, #1377.
Below is the list of all functions which are either row- or column-oriented. Those marked with * could be interpreted/useful both as row-oriented or column-oriented, and are therefore the most problematic (i.e. the ones to make consistent).
Row-oriented functions:
append!
*filter
/filter!
*head
&tail
*sort
/sort!
&sortperm
&issorted
unique
&nonunique
completecases
dropmissing
/dropmissing!
Column-oriented functions:
delete!
*insert!
(inconsistent withappend!
andpush!
)*merge!
*haskey
&get
(oldAssociative
interface)*length
*getindex
(with single argument)*allowmissing
/allowmissing!
disallowmissing
/disallowmissing!
categorical!
names
&rename
eltypes
describe
Both row- and column-oriented functions:
empty!
Functions with explicit name:
deleterows!
(compare withdelete!
)permutecols!
nrow
/ncol
hcat
/vcat
colwise
eachrow
/eachcol
Complex:
by
/groupby
-> mostly row-orientedaggregate
-> mostly column-oriented (akin tocolwise
)Overall, it seems it would be easier and more natural to get rid of column-oriented functions marked with *, and consider that a data frame is a collection of rows. Many of the problematic functions are either inherited from the time when
DataFrame
implemented theAssociative
interface, which isn't very useful in practice. Others (likeinsert!
) are in conflict with similar functions which operate on rows. I suggest we rename the ones which are deemed useful to add thecols
suffix. Functions withrows
in their names could drop that suffix. This means:delete!
would becomedeletecols!
(and replaced withdeleterows!
after deprecation period),insert!
would becomeinsertcols!
(PR Deprecate delete!, insert! and merge! #1560)length
would be deprecated in favor ofncol
/size(df, 2)
(length(::DataFrame) returns number of columns #1200, see also Remove nrow/ncol #406; PR Deprecate length, nrow, and ncol on DataFrames in favor of size. Fixe… #1224)merge!
would be removed, as it's not that useful and it's often confused withjoin
(due to the name used in R and other apps) (PR Deprecate delete!, insert! and merge! #1560)haskey
andget
would be removed (Deprecate get and haskey for AbstractDataFrame #1836)empty!
could be changed to callsimilar(df, 0)
(Deprecate empty! #1843 for the deprecation phase)df[col]
would still be allowed since it's more convenient thandf[:, col]
(length(::DataFrame) returns number of columns #1200)Some column-oriented functions like
name
andrename!
could be added thecols
suffix, but I'm not sure it's worth it. More changes could be considered starting from the premise that collection functions and iteration operate on rows, but we could also continue being explicit about dimensions in other places (likemap
).The main/only issue with viewing data frames as collections of rows is that despite being natural, it goes against the underlying representation as a vector of columns. But that's not necessarily a problem in practice as long as we provide convenient ways of applying operations to columns.
The text was updated successfully, but these errors were encountered: