Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review row vs. column orientation of API #1514

Closed
6 tasks done
nalimilan opened this issue Sep 18, 2018 · 35 comments
Closed
6 tasks done

Review row vs. column orientation of API #1514

nalimilan opened this issue Sep 18, 2018 · 35 comments

Comments

@nalimilan
Copy link
Member

nalimilan commented Sep 18, 2018

It's been noted several times that we don't have a consistent view on whether a data frame is a collection of rows or of columns. We should decide which one it is so that all exported functions and collection functions from Base we implement operate either on rows or on columns. Then functions which operate on the other dimension should mention it explicitly in their names, e.g. with the suffix rows or cols (as in IndexedTables).

See also: #406, #1200, #1459, #1513, #1377.

Below is the list of all functions which are either row- or column-oriented. Those marked with * could be interpreted/useful both as row-oriented or column-oriented, and are therefore the most problematic (i.e. the ones to make consistent).

Row-oriented functions:
append!*
filter/filter!*
head & tail*

sort/sort! & sortperm & issorted
unique & nonunique
completecases
dropmissing/dropmissing!

Column-oriented functions:
delete!*
insert! (inconsistent with append! and push!)*
merge!*
haskey & get (old Associative interface)*
length*
getindex (with single argument)*

allowmissing/allowmissing!
disallowmissing/disallowmissing!
categorical!
names & rename
eltypes
describe

Both row- and column-oriented functions:
empty!

Functions with explicit name:
deleterows! (compare with delete!)
permutecols!
nrow/ncol
hcat/vcat
colwise
eachrow/eachcol

Complex:
by/groupby -> mostly row-oriented
aggregate -> mostly column-oriented (akin to colwise)


Overall, it seems it would be easier and more natural to get rid of column-oriented functions marked with *, and consider that a data frame is a collection of rows. Many of the problematic functions are either inherited from the time when DataFrame implemented the Associative interface, which isn't very useful in practice. Others (like insert!) are in conflict with similar functions which operate on rows. I suggest we rename the ones which are deemed useful to add the cols suffix. Functions with rows in their names could drop that suffix. This means:

Some column-oriented functions like name and rename! could be added the cols suffix, but I'm not sure it's worth it. More changes could be considered starting from the premise that collection functions and iteration operate on rows, but we could also continue being explicit about dimensions in other places (like map).

The main/only issue with viewing data frames as collections of rows is that despite being natural, it goes against the underlying representation as a vector of columns. But that's not necessarily a problem in practice as long as we provide convenient ways of applying operations to columns.

@pdeffebach
Copy link
Contributor

I like all these changes. I think of in stata how if statements are both succinct and refer to row-wise subsetting.

I also like getting rid of merge!, as it is really just an implementation of a join under certain parameters. And I have gotten confused by delete! etc. before.

@nalimilan
Copy link
Member Author

nalimilan commented Sep 20, 2018

Cc: @piever @shashi with the goal of using consistent APIs in JuliaDB and DataFrames where that makes sense.

@piever
Copy link

piever commented Sep 20, 2018

In StructArrays I decided for the following (pretty much in line with IndexedTables):

  • A table is a collection of rows, meaning StructArray <: AbstractArray{<:NamedTuple}
  • Indexing selects rows
  • getproperty selects columns

And I've been pretty happy with it: getting a row is s[i] a column is s.b, an element is either s.b[i] or s[i].b.

Is it really necessary to keep the one argument getindex select columns given that there is getproperty? I'd definitely be very happy if a DataFrame could be thought of as a AbstractArray{<:NamedTuples} (does the weak typing play against this though? is there a way to have a row based getindex be efficient?).

I don't have a clear view of the whole situation but I wanted to share the following:

  • We need a unified way to decide whether something is for rows or cols (in JuliaDBMeta I use _vec, i.e. @transform, @where become@transform_vec, @where_vec: maybe I can change to @transformcols and @wherecols ). At some point we thought of having hmerge and vmerge in JuliaDB (in analogy with hcat, vcat). merge and mergecols definitely makes sense (merge is for rows at the moment
  • Concerning filter and getindex for columns, they all correspond to select in JuliaDB. There is a bunch of helpers for operations. For example, select(iris, 3), select(iris, 1:4), select(iris, Not(:SepalLength)), select(iris, t -> contains(string(t), "Sepal")). Filtering is automatic for Regexes, for example select(iris, r"^Sepal"). See docs here.

A key difference however is that the inplace version doesn't make sense for IndexedTables. I wonder if simply a version of select! that replaces the columns of the DataFrame with the outcome of select would make sense.

  • The functions to explicitly manipulate columns have col in the name in JuliaDB. This is unfortunate because they can manipulate many columns as once. See renamecol(iris, :SepalLength => :SL, :SepalWidth => :SW), see here. Could we unify on the cols version?

@nalimilan
Copy link
Member Author

Thanks for the comments.

Is it really necessary to keep the one argument getindex select columns given that there is getproperty? I'd definitely be very happy if a DataFrame could be thought of as a AbstractArray{<:NamedTuples} (does the weak typing play against this though? is there a way to have a row based getindex be efficient?).

One issue is that getproperty only works to extract a single column from a literal name. Things like df[cols] and df[[:a, :b, :c]] are quite commonly needed, and people coming from other languages will expect them to work. I guess it wouldn't be the end of the world if we required df[:, cols] for that, but that wouldn't be great (also it's supposed to make a copy for consistency with other df[i, j] operations).

You're also right that AFAICT the absence of parameterization on the column types makes it difficult to iterate over rows of a data frame efficiently.

We need a unified way to decide whether something is for rows or cols (in JuliaDBMeta I use _vec, i.e. @transform, @where become@transform_vec, @where_vec: maybe I can change to @transformcols and @wherecols ). At some point we thought of having hmerge and vmerge in JuliaDB (in analogy with hcat, vcat). merge and mergecols definitely makes sense (merge is for rows at the moment

Interesting. That's indeed an issue I've thought about several times in DataFramesMeta. AFAIK people generally want row-wise operations, except for two particular cases: 1) reductions (like :x .- mean(:x)) and 2) laggged variables. And even in these cases, it would be be convenient to think in row-wise terms so that e.g. one doesn't need to use dotted operators. SQL and dplyr don't provide column-wise functions. So I wonder whether it wouldn't be possible to have only row-wise macros, with a special function or syntax to indicate that one refers to the whole column rather than to a single element (for example :x - mean($x) or :x - mean(col(:x))). Probably better discuss this elsewhere though.

Concerning filter and getindex for columns, they all correspond to select in JuliaDB. There is a bunch of helpers for operations. For example, select(iris, 3), select(iris, 1:4), select(iris, Not(:SepalLength)), select(iris, t -> contains(string(t), "Sepal")). Filtering is automatic for Regexes, for example select(iris, r"^Sepal"). See docs here.

filter only operates on rows in DataFrames. We don't have convenience functions to select columns, but they would be useful.

Don't you find it a bit problematic that select picks columns, but that the name doesn't indicate that?

A key difference however is that the inplace version doesn't make sense for IndexedTables. I wonder if simply a version of select! that replaces the columns of the DataFrame with the outcome of select would make sense.

Yeah, why not. But that's really not a big issue, as it can be added later.

The functions to explicitly manipulate columns have col in the name in JuliaDB. This is unfortunate because they can manipulate many columns as once. See renamecol(iris, :SepalLength => :SL, :SepalWidth => :SW), see here. Could we unify on the cols version?

Indeed, that's annoying. I'm not sure whether the plural or the singular is best, but least the naming should be consistent everywhere.

@piever
Copy link

piever commented Sep 20, 2018

You're also right that AFAICT the absence of parameterization on the column types makes it difficult to iterate over rows of a data frame efficiently.

IIUC, map and filter row-based can be fast because they can use the Tables interface to have efficient row iteration, whereas direct row-based getindex is always slow? In this case this is probably a sufficient technical difference to justify a different default from IndexedTables and one might as well tell the user to do

r = Tables.rows(df)
r[i]

to have fast row iteration.

We need a unified way to decide whether something is for rows or cols (in JuliaDBMeta I use _vec, i.e. @transform, @where become @transform_vec, @where_vec: maybe I can change to @transformcols and @wherecols). At some point we thought of having hmerge and vmerge in JuliaDB (in analogy with hcat, vcat). merge and mergecols definitely makes sense (merge is for rows at the moment

That's indeed an issue I've thought about several times in DataFramesMeta. AFAIK people generally want row-wise operations, except for two particular cases: 1) reductions (like :x .- mean(:x)) and 2) laggged variables.

There are also some other cases (say sorting or things like that) but esp. in JuliaDB they are tricky anyway because they don't generalize to the distributed case as easily. Viceversa sometimes when one uses the vectorised version, it could probably be done with for example OnlineStats in a way that parallelize naturally and works in distributed scenarios. I'd be definitely happy to discuss this further though, I'll open an issue in JuliaDBMeta.

Don't you find it a bit problematic that select picks columns, but that the name doesn't indicate that?
Don't you find it a bit problematic that select picks columns, but that the name doesn't indicate that?

select is a bit complex in JuliaDB, but I agree it's column oriented and is one of the few column oriented operation that doesn't make it explicit in the name. Not sure what would be a good name that emphasizes this.

Indeed, that's annoying. I'm not sure whether the plural or the singular is best, but least the naming should be consistent everywhere.

I think it should to be cols because:

  • Julia Base also seems to prefer plurals, like sum(v, dims = ...)
  • cols works pretty much everywhere whereas col is strange at times, say mergecol

@bkamins
Copy link
Member

bkamins commented Sep 20, 2018

If we change this then keys and values should be cleaned-up or removed and maybe rethinking what eachrow and eachcol do/return is in place.

@nalimilan
Copy link
Member Author

If we change this then keys and values should be cleaned-up or removed and maybe rethinking what eachrow and eachcol do/return is in place.

keys and values are deprecated already, so that's OK. OTOH I'd like to fix the asymmetry between eachrow (which returns DataFrameRow objects) and eachcol (which returns (name, vector) tuples). We could perhaps deprecate them in favor of rows and columns (or keep eachcol if that behavior is really useful).

@piever
Copy link

piever commented Sep 20, 2018

Once the Tables PR is merged, I imagine Tables.rows and Tables.columns could be exported and would provide a lot of this functionality.

I think the following is quite nice and will work with all table types:

for v in columns(df)
    ...
end
for (key, val) in pairs(columns(df))
    ...
end
for row in rows(df)
    ...
end

@nalimilan
Copy link
Member Author

IIUC, map and filter row-based can be fast because they can use the Tables interface to have efficient row iteration, whereas direct row-based getindex is always slow? In this case this is probably a sufficient technical difference to justify a different default from IndexedTables and one might as well tell the user to do

r = Tables.rows(df)
r[i]

to have fast row iteration.

Well, yeah, constructing a type-stable iterator allows fast iteration over rows. But that only helps if there's a function barrier to allow for specialization.

Also it's not clear whether it's reasonable to specialize on the type of all columns when there might be hundreds of them. See #1335. That's why macros like DataFramesMeta/JuliaDBMeta seem promising to me: they allow specializing only on the relevant columns, which make it easier for the compiler.

@bkamins
Copy link
Member

bkamins commented Sep 20, 2018

I know they are deprecated, but now we could assign them meaning consistent with the interpretation that is chosen. Using rows and columns is fine with me.

Also eachcol is not really that useful, especially when we have columns (but maybe there is some legacy code that uses it so I guess it should stay).

Finally there are map methods in DataFrames.jl which confuse users. This also could be cleaned up.

Of course we do not have to do everything in one shot. The major decision is that we treat DataFrame a row-based object. The rest would be a consequence that can be fixed later.

@piever
Copy link

piever commented Sep 20, 2018

Well, yeah, constructing a type-stable iterator allows fast iteration over rows. But that only helps if there's a function barrier to allow for specialization.

I've never worked with "very large" data, so I'm not familiar with this type of issues, but I wanted to mention that Tables is quite smart in its implementation in that the row is just a lazy wrapper of the table, so you don't actually materialize it and rows(df)[i].t is just another way of saying columns(df).t[i]. The other fields are untouched (JuliaDB instead would materialize the whole row unless specified otherwise, which is very bad in large tables). In this respect I think it works just as well as the macros in JuliaDBMetas that detect from the symbol which columns are used and only select those. But I agree that we need some care, compile times do get worrying in JuliaDB at times.

@nalimilan
Copy link
Member Author

I've never worked with "very large" data, so I'm not familiar with this type of issues, but I wanted to mention that Tables is quite smart in its implementation in that the row is just a lazy wrapper of the table, so you don't actually materialize it and rows(df)[i].t is just another way of saying columns(df).t[i]. The other fields are untouched (JuliaDB instead would materialize the whole row unless specified otherwise, which is very bad in large tables). In this respect I think it works just as well as the macros in JuliaDBMetas that detect from the symbol which columns are used and only select those. But I agree that we need some care, compile times do get worrying in JuliaDB at times.

The issue isn't whether data is materialized or not, but just that AFAIK the compiler bails out beyond a certain number of columns (unless they are all of the same type, i.e. an NTuple). But I haven't investigated this deeply. Let's continue this discussion at #1335.

@nalimilan
Copy link
Member Author

I know they are deprecated, but now we could assign them meaning consistent with the interpretation that is chosen. Using rows and columns is fine with me.

Also eachcol is not really that useful, especially when we have columns (but maybe there is some legacy code that uses it so I guess it should stay).

+1

Finally there are map methods in DataFrames.jl which confuse users. This also could be cleaned up.

Which ones? There was a very old map method for SubDataFrame which I deprecate in my grouping PR, but I don't see methods for DataFrame.

@kleinschmidt
Copy link
Contributor

I'm generally in favor of treating tables as collections of rows at the API level.

@bkamins
Copy link
Member

bkamins commented Sep 20, 2018

I did not mean DataFrame but other types in DataFrames.jl. Those are map methods in DataFrames.jl as of today:

map(f::Function, sdf::SubDataFrame)
map(f::Function, gd::GroupedDataFrame)
map(f::Function, ga::GroupApplied)
map(f::Function, dfri::DataFrames.DFRowIterator)
map(f::Function, dfci::DataFrames.DFColumnIterator)

and I think a review of them for consistency with those changes would be good.

In general my thinking is to simplify DataFrames.jl API and minimize it as much as possible, as some things that were needed when this package was created probably could be removed now given the whole data wrangling ecosystem we have.

@nalimilan
Copy link
Member Author

Apart from the SubDataFrame one and the GroupApplied one which I propose removing in #1520, the others look OK to me since they are completely explicit. map(f, columns(df) and map(f, rows(df)) look really clean.

Let's discuss the GroupedDataFrame method at #1520.

@matthieugomez
Copy link
Contributor

matthieugomez commented Sep 24, 2018

I like it — the only thing is that maybe vars/variables should be used instead of the col/columns suffix.

@bkamins
Copy link
Member

bkamins commented Sep 27, 2018

Seems going for collection of rows is a choice that dominates.

So the only leftover we would have is df[col] syntax that is inconsistent with this approach. A slight disturbance is that we would not be able to define eachindex and enumerate.

Finally there is a question if we want DataFrame to be broadcastable over rows.

@nalimilan
Copy link
Member Author

A slight disturbance is that we would not be able to define eachindex and enumerate.

Yes but we could have eachindex(columns(df)) and enumerate(columns(df)).

Finally there is a question if we want DataFrame to be broadcastable over rows.

Good question. There's no hurry to support it, but that would indeed be consistent. That's actually one of the rare cases where operating over columns by default would probably be more useful, but we can provide other ways to do that like colwise or f.(columns(df)).

@bkamins
Copy link
Member

bkamins commented Oct 3, 2018

Actually if we go for row-based approach haskey could be renamed to hascolumn.

@nalimilan
Copy link
Member Author

#1560 gets rid of the most problematic methods. We can discuss what to do about others in separate PRs.

@andyferris
Copy link
Member

I just wanted to add my support for the way this is evolving! ❤️

I love the "relation is a collection of rows" style of interface. If the xxxcols(df) suffix is tiresome then xxx(columns(df)) is pretty clear - columns(df) is a kinda interesting and useful container in it's own right.

Actually if we go for row-based approach haskey could be renamed to hascolumn.

I suppose if we support df.column syntax then hasproperty is natural (you don't necessarily need to ditch column-based getindex just because you support column-based properties). Pity that didn't make it to Base for Julia 1.0 though.

@nalimilan
Copy link
Member Author

I love the "relation is a collection of rows" style of interface. If the xxxcols(df) suffix is tiresome then xxx(columns(df)) is pretty clear - columns(df) is a kinda interesting and useful container in it's own right.

Actually with #1590 we're taking the stance that mapcols(f, df) is different from map(f, columns(df)), since the former returns a data frame and the latter a vector. So the xxxcols functions couldn't be replaced with xxx(columns(df)).

@bkamins
Copy link
Member

bkamins commented Nov 13, 2018

To my understanding we introduced mapcols solely to ensure that we give this different result.

@andyferris
Copy link
Member

Ok, yeah that makes sense.

@nalimilan
Copy link
Member Author

Everything seems to be complete here. The general rule we follow is that generic collection functions that operate on elements (sort, push!...) operate over rows (when they are defined). Functions that operate over columns either end with the "cols" suffix, are table-specific operations (e.g. select), or operate at the vector level rather than at the element level (describe, allowmissing...).

names/rename! are partial exceptions to this rule. FWIW, JuliaDB uses renamecol, but dplyr and Pandas use rename.

@piever
Copy link

piever commented Sep 3, 2019

FWIW, JuliaDB uses renamecol, but dplyr and Pandas use rename.

Actually JuliaDB has recently deprecated renamecol in favor of rename, so the APIs are consistent here.

@bkamins
Copy link
Member

bkamins commented Sep 3, 2019

Should we then define rename in DataAPI.jl?

CC @quinnj

@andyferris
Copy link
Member

andyferris commented Sep 4, 2019

I haven't been following the discussion for quite some time - but what is the future of iterate(::DataFrame)?

It seems that first(::DataFrame) returns the first row... and first is normally defined in terms of the iteration API.

We have filter(f, ::DataFrame) working as if it iterates rows. But there is no map or reduce that does the same.

I guess the collection of implemented functions seems partially complete to me... @nalimilan would you welcome contributions here?

@nalimilan
Copy link
Member Author

We haven't really decided whether iteration (and map) should be supported. For now we wanted to stabilize the current API for 1.0. It would be consistent to define these over rows, but that may also be confusing (notably because of the difference with broadcasting), so this needs to be discussed in depth after 1.0.

@andyferris
Copy link
Member

Oh I see, broadcasting works over each cell. Thanks

@quinnj
Copy link
Member

quinnj commented Sep 4, 2019

I'd be up for supporting something like Tables.rename!(tbl, :oldcol, :newcol) in Tables.jl, and any Tables.jl source that supported column renaming could use it.

@bkamins
Copy link
Member

bkamins commented Sep 4, 2019

The way I think about it is that rename! is in place so probably not all source will support it. rename creates a new table so this should be OK for any table.

But do you mean then to define the bodies of rename (or rename!) or just define a function without implementing the behavior (e.g. DataFrame will most likely have a custom rename! and rename for performance reasons).

@quinnj
Copy link
Member

quinnj commented Sep 4, 2019

Yeah, I meant we could just have function rename! end in Tables.jl and individual packages could provide the implementation.

bkamins added a commit to JuliaData/Tables.jl that referenced this issue Sep 4, 2019
Following JuliaData/DataFrames.jl#1514 (comment).

The open question is if we want both `rename` and `rename!` in the common API (`rename` is probably more universally needed, `rename!` is applicable in DataFrames.jl but not in contexts where table does not allow changing column names in-place). I propose to have both but please comment (`rename!` can also live in DataFrames.jl only otherwise)

CC @piever - for syncing with JuliaDB.jl.
@bkamins
Copy link
Member

bkamins commented Sep 4, 2019

OK - so I am closing this issue and opened JuliaData/Tables.jl#119.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants