Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do we need columns and eachcol #1586

Closed
bkamins opened this issue Nov 6, 2018 · 14 comments
Closed

Do we need columns and eachcol #1586

bkamins opened this issue Nov 6, 2018 · 14 comments

Comments

@bkamins
Copy link
Member

bkamins commented Nov 6, 2018

We currently have columns (not exported) function and eachcol (exported) that essentially serve the same purpose. Do we need to keep them both and what should be the direction in design here?

CC @nalimilan

@nalimilan
Copy link
Member

eachcol is different from columns as it yields (name, vector) tuples. But indeed I've wondered for some time whether we shouldn't deprecate eachcol and export columns. We could also keep both, it doesn't really hurt and it might be a bit faster than doing df[name] repeatedly (only significant when there are only a few rows, though).

@bkamins
Copy link
Member Author

bkamins commented Nov 6, 2018

I am aware that they are a bit different, but I meant that eachcol covers all use cases of columns and should not be that much slower so actually columns is not needed if we keep eachcol.

For me it only hurts that it makes the design of DataFrames.jl package more complex to have two types serve almost the same purpose: Cols and DFColumnIterator (it is complex enough already which I have learned while working on getindex 😄).

@nalimilan
Copy link
Member

OK. But eachcol is also inconvenient if you just want to apply a function to all columns in a data frame.

@bkamins
Copy link
Member Author

bkamins commented Nov 6, 2018

Yes, but columns is not exported so it does not really matter (unless we decide to export it - and that is why I am asking because I am confused what is best).

@bkamins
Copy link
Member Author

bkamins commented Nov 7, 2018

Additional points to gather the facts (the first of them was discussed earlier):

  • columns conflicts with Tables.columns (I know that there it is not exported, but nevertheless it might be confusing)
  • we have an implementation which effectively creates Cols object for SubDataFrame and accesses field columns for DataFrame - so it is not a unified interface;
  • I have checked the usage of columns in the package (which is enough as this function is internal) and actually there are not many places where it is used.

@pdeffebach
Copy link
Contributor

+1 to the SubDataFrame concern. Since eachcol is an iterator, it can be used with a subdataframe easily. But columns returns a vector of vectors, making (i think) it infeasible for a subdataframe.

@bkamins
Copy link
Member Author

bkamins commented Nov 7, 2018

@pdeffebach Note that is that currently columns(sdf) where sdf is a SubDataFrame is defined and returns object of type Cols.

@pdeffebach
Copy link
Contributor

Ah that makes sense. Sorry for the confusion. Yeah it makes sense that Cols is a bit too one-off to be user-facing.

@bkamins
Copy link
Member Author

bkamins commented Nov 8, 2018

OK. I know what I would do (and fortunately it is functionally non-breaking):

  • remove Cols type;
  • add an additional true/false parameter to DFColumnIterator type;
  • the parameter is interpreted if we return only column values or column names and column values and works as follows:
    • true makes the iterator return tuples (colname, coldata) (current behavior of DFColumnIterator);
    • false makes the iterator return vectors with coldata (current behavior of Cols);
  • eachcol(df) returns DFColumnIterator{T, true};
  • columns(df) returns DFColumnIterator{T, false};
  • we export columns function.

@nalimilan and @pdeffebach - if I have a green light on it and #1585 is merged (as it touches this code also so I want to avoid rebasing) I will implement this as a PR.

@nalimilan
Copy link
Member

Makes sense.

@bkamins
Copy link
Member Author

bkamins commented Nov 8, 2018

OK - I will implement it along with taking #1587 into account to make it consistent (for this I will wait with this PR till #1587 is merged so that I can then update the documentation in one shot).

@pdeffebach
Copy link
Contributor

welp, i guess this opens the door to rows(df) with a similar logic. I think we can make an argument that it's not needed because we don't really have a row index, and the row number isnt that important in the scheme of things.

@nalimilan
Copy link
Member

A type-stable iterator over rows of a data frame (passed as named tuples) could be useful if it allowed specializing the user-provided function. But that can be discussed separately.

@bkamins
Copy link
Member Author

bkamins commented Dec 14, 2018

Closing as this is implemented and merged in consistency with Julia 1.1.

@bkamins bkamins closed this as completed Dec 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants