Do we need columns and eachcol #1586

bkamins · 2018-11-06T10:49:10Z

We currently have columns (not exported) function and eachcol (exported) that essentially serve the same purpose. Do we need to keep them both and what should be the direction in design here?

CC @nalimilan

The text was updated successfully, but these errors were encountered:

nalimilan · 2018-11-06T20:51:27Z

eachcol is different from columns as it yields (name, vector) tuples. But indeed I've wondered for some time whether we shouldn't deprecate eachcol and export columns. We could also keep both, it doesn't really hurt and it might be a bit faster than doing df[name] repeatedly (only significant when there are only a few rows, though).

bkamins · 2018-11-06T20:56:45Z

I am aware that they are a bit different, but I meant that eachcol covers all use cases of columns and should not be that much slower so actually columns is not needed if we keep eachcol.

For me it only hurts that it makes the design of DataFrames.jl package more complex to have two types serve almost the same purpose: Cols and DFColumnIterator (it is complex enough already which I have learned while working on getindex 😄).

nalimilan · 2018-11-06T21:19:35Z

OK. But eachcol is also inconvenient if you just want to apply a function to all columns in a data frame.

bkamins · 2018-11-06T21:25:29Z

Yes, but columns is not exported so it does not really matter (unless we decide to export it - and that is why I am asking because I am confused what is best).

bkamins · 2018-11-07T07:02:50Z

Additional points to gather the facts (the first of them was discussed earlier):

columns conflicts with Tables.columns (I know that there it is not exported, but nevertheless it might be confusing)
we have an implementation which effectively creates Cols object for SubDataFrame and accesses field columns for DataFrame - so it is not a unified interface;
I have checked the usage of columns in the package (which is enough as this function is internal) and actually there are not many places where it is used.

pdeffebach · 2018-11-07T17:38:40Z

+1 to the SubDataFrame concern. Since eachcol is an iterator, it can be used with a subdataframe easily. But columns returns a vector of vectors, making (i think) it infeasible for a subdataframe.

bkamins · 2018-11-07T19:17:46Z

@pdeffebach Note that is that currently columns(sdf) where sdf is a SubDataFrame is defined and returns object of type Cols.

pdeffebach · 2018-11-07T19:42:23Z

Ah that makes sense. Sorry for the confusion. Yeah it makes sense that Cols is a bit too one-off to be user-facing.

bkamins · 2018-11-08T18:25:15Z

OK. I know what I would do (and fortunately it is functionally non-breaking):

remove Cols type;
add an additional true/false parameter to DFColumnIterator type;
the parameter is interpreted if we return only column values or column names and column values and works as follows:
- true makes the iterator return tuples (colname, coldata) (current behavior of DFColumnIterator);
- false makes the iterator return vectors with coldata (current behavior of Cols);
eachcol(df) returns DFColumnIterator{T, true};
columns(df) returns DFColumnIterator{T, false};
we export columns function.

@nalimilan and @pdeffebach - if I have a green light on it and #1585 is merged (as it touches this code also so I want to avoid rebasing) I will implement this as a PR.

nalimilan · 2018-11-08T18:32:03Z

Makes sense.

bkamins · 2018-11-08T19:15:24Z

OK - I will implement it along with taking #1587 into account to make it consistent (for this I will wait with this PR till #1587 is merged so that I can then update the documentation in one shot).

pdeffebach · 2018-11-08T19:35:23Z

welp, i guess this opens the door to rows(df) with a similar logic. I think we can make an argument that it's not needed because we don't really have a row index, and the row number isnt that important in the scheme of things.

nalimilan · 2018-11-08T19:44:40Z

A type-stable iterator over rows of a data frame (passed as named tuples) could be useful if it allowed specializing the user-provided function. But that can be discussed separately.

bkamins · 2018-12-14T09:30:03Z

Closing as this is implemented and merged in consistency with Julia 1.1.

bkamins mentioned this issue Nov 8, 2018

Unify eachcol and columns functions #1590

Merged

nalimilan mentioned this issue Nov 30, 2018

Tables.jl uses eachcolumn and a new PR to base does too #1611

Closed

bkamins closed this as completed Dec 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do we need columns and eachcol #1586

Do we need columns and eachcol #1586

bkamins commented Nov 6, 2018

nalimilan commented Nov 6, 2018

bkamins commented Nov 6, 2018

nalimilan commented Nov 6, 2018

bkamins commented Nov 6, 2018

bkamins commented Nov 7, 2018

pdeffebach commented Nov 7, 2018

bkamins commented Nov 7, 2018

pdeffebach commented Nov 7, 2018

bkamins commented Nov 8, 2018

nalimilan commented Nov 8, 2018

bkamins commented Nov 8, 2018

pdeffebach commented Nov 8, 2018

nalimilan commented Nov 8, 2018

bkamins commented Dec 14, 2018

Do we need columns and eachcol #1586

Do we need columns and eachcol #1586

Comments

bkamins commented Nov 6, 2018

nalimilan commented Nov 6, 2018

bkamins commented Nov 6, 2018

nalimilan commented Nov 6, 2018

bkamins commented Nov 6, 2018

bkamins commented Nov 7, 2018

pdeffebach commented Nov 7, 2018

bkamins commented Nov 7, 2018

pdeffebach commented Nov 7, 2018

bkamins commented Nov 8, 2018

nalimilan commented Nov 8, 2018

bkamins commented Nov 8, 2018

pdeffebach commented Nov 8, 2018

nalimilan commented Nov 8, 2018

bkamins commented Dec 14, 2018