Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add transformation and renaming to select and select! #2080

Merged
merged 51 commits into from
Mar 19, 2020

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Jan 6, 2020

This is a first step in implementation of the discussion in #1975.

The functionality is limited, but I first wanted to make sure that on this level we are in agreement. If this is so I will add tests to this PR. After this is merged we can talk about adding more functionalities in separate PRs (even this new functionality is complex enough that writing proper tests for it will be a challenge, so I want to move forward slowly).

Also even with what I propose here a lot can be achieved, actually what is missing is only:

  • Col wrapper for whole-column functions (we did not have an agreement about the name)
  • automatic generation of column names if they are missing (here we first have to set a common rules for all functions in DataFrames.jl) (actually I have done it for the case of one column, but just forgotten that it is there 😄, we still have to decide on column name generation for multiple columns)
  • passing several columns to a function (this would be nice but I think it can wait as it is not a supper common case)
  • nesting transformations in vectors and broadcasting of transformations (it would be nice to have it, but first we have to resolve broadcasting for Not)

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for starting this! I've made a few high-level comments.

If you have the time, it would make sense IMO to also implement support for several input columns in this PR. The structure of the code will probably be affected so better get it right from the start.

src/abstractdataframe/selection.jl Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Jan 8, 2020

implement support for several input columns in this PR.

I can add it (I have tried to design the code to make such additions easy - normalize_selection has exactly the purpose to make it simple).

Let us just agree on one thing. In the case of multiple columns passed do we want to pass NamedTuples of values in rows to the function? So essentially the operation would be:

fun.(getfield.(Tables.rows(Tables.columntable(df[!, seected_columns])), :columns))

(this should be efficient)

or we want to pass DataFrameRows:

fun.(eachrow(df[!, seected_columns]))

(this avoids recompilation each time we make a selection)

The API except for the type is essentially the same with the difference that we pass a different typee and the latter potentially allows the function that takes DataFrameRow to potentially mutate source data frame.

What do you think?

@bkamins
Copy link
Member Author

bkamins commented Jan 8, 2020

Thank you for the comments. For now we have the following key open issues:

  • if we want to keep transform or leave it out for now
  • if we pass multiple columns for transformations if we should pass NamedTuple or DataFrameRow to the function
  • how to generate column name if we pass multiple columns

I leave Col wrapper to make passing whole-columns for later. However it might guide NamedTuple or DataFrameRow decision. Because in Col wrapper the equivalent decision is if we pass a NamedTuple of abstract vectors vs. passing a SubDataFrame.

(also probably you have an opinion given the design in combine 😄)

@nalimilan
Copy link
Member

I think we should pass a NamedTuple, just like in combine. Otherwise things are going to be quite slow. Though at some point we could support an argument or a syntax to use DataFrameRow.

Regarding the generation of column names, I guess it could make sense to list the names of all input variables up to a threshold (e.g. 3)? One tricky case is when the input columns are selected via : or a regex (or another kind of selection rule): is that a good idea to use the name of the columns, since they are relatively hard to find for the caller?

I don't have a strong opinion about transform, but I tend to prefer keeping PRs as small as possible. :-)

@bkamins bkamins changed the title Add transformation and renaming to select; define transform and transform! Add transformation and renaming to select Jan 9, 2020
@bkamins
Copy link
Member Author

bkamins commented Jan 9, 2020

OK - I have made all the changes:

  • removed transform
  • added auto-generation of column names (please see what I proposed - if we agree to this I would use the same in combine in the future)
  • started using NamedTuple in multi-column selection (but I have a small question there - I have added a separate question there)

@bkamins
Copy link
Member Author

bkamins commented Jan 10, 2020

@nalimilan - I think I have addressed all the design issues in this PR. The only question is if you are OK with multiple-column automatic naming scheme I have proposed in https://github.com/JuliaData/DataFrames.jl/pull/2080/files#diff-ac2eb247bb3d79f652033279061a1ceaR40.

When we confirm we are OK with the design I will add tests of the new functionality (I keep things pending in TODO at the top of the file). Essentially for the two other PRs the things that will be left to do are Col wrapper and transform functions.

src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
bkamins and others added 2 commits January 10, 2020 19:21
Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member Author

bkamins commented Jan 10, 2020

Yeah - this PR is not easy (and this is what I have expected). Please let me know when you will be clear with the design decisions I proposed and I will then start writing tests.

src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Jan 13, 2020

Just to sum up the current pending design decisions:

  • automatic column naming scheme (if we agree on the rule I used here currently I would synchronize it with combine to have consistency - this would be breaking)
  • allowing single column selection duplicates in select and in All (i.e. if All(:a, :a) should be an error or be allowed and select a single column :a)

When we settle this I will update the docs and write tests for the whole PR.

@bkamins
Copy link
Member Author

bkamins commented Jan 15, 2020

Change agreed with @piever is pushed now.

@nalimilan - so we are left with confirming that you agree to the automatic column naming rules I proposed in this PR (they then should be ported to combine for consistency).

@nalimilan
Copy link
Member

Likewise, I'd throw an error for now. :-)

@bkamins
Copy link
Member Author

bkamins commented Feb 29, 2020

OK - so I will restrict the list of what is allowed to be returned to match combine. The good thing is that the user always can write [named_tuple_or_similar] to get what one wants.

@bkamins bkamins changed the title WIP: add transformation and renaming to select Add transformation and renaming to select and select! Mar 2, 2020
@bkamins
Copy link
Member Author

bkamins commented Mar 2, 2020

I have finished working on the implementation, documentation and tests of the new functionality. So comments would be appreciated.

Apart from implementing the core functionality I have caught some small things that needed to be polished in other sections of the source codes (chiefly indexing of SubDataFrame with All and Between).

@bkamins
Copy link
Member Author

bkamins commented Mar 7, 2020

One last (hopefully) design decision given the consistency with combine consideration.

If select(df, cols => fun) is written and fun returns AbstractDataFrame, NamedTuple, DataFrameRow, AbstractMatrix maybe we should already allow it and make it create multiple columns - like combine. The question is relevant because allowing this would change the internal design.

@nalimilan
Copy link
Member

For now I think it's OK to throw an error, the internal design can be changed later as long as it doesn't affect user-visible behavior.

@bkamins
Copy link
Member Author

bkamins commented Mar 7, 2020

OK - so in this case this PR is ready for a review 😄.

@bkamins
Copy link
Member Author

bkamins commented Mar 12, 2020

TODO: make src => fun => dst not copy what fun returned even if copycols=true unless we can check that the return value is === to any elements of src or it is a SubVector.

@bkamins
Copy link
Member Author

bkamins commented Mar 12, 2020

I have changed to "optimistic" mode in copycols=true case. Things get tricky here, so a careful look at the last commit would be welcome. Thank you.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. Looks quite good, just a few more details.

docs/src/man/getting_started.md Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
test/select.jl Outdated Show resolved Hide resolved
test/select.jl Outdated Show resolved Hide resolved
test/select.jl Show resolved Hide resolved
test/select.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
@pdeffebach
Copy link
Contributor

I just checked this out and played around. Doing

select(df, :, [:a, :b] => + => :c)

throws

julia> select(df, :, [:a, :b] => + => :c)
ERROR: syntax: "=>" is not a unary operator
Stacktrace:
 [1] top-level scope at REPL[27]:0

Would there be a way to check if a function is a unary operator and, if so, wrap it in parentheses? Or would that require metaprogramming.

@bkamins
Copy link
Member Author

bkamins commented Mar 18, 2020

Would there be a way to check if a function is a unary operator and, if so, wrap it in parentheses? Or would that require metaprogramming.

We cannot do anything about it, even + => something does not parse correctly. One has to wrap + in (+) as you note.

bkamins and others added 2 commits March 18, 2020 22:40
Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member Author

bkamins commented Mar 18, 2020

Thank you for the review (this seems to be the longest PR in terms of comments and commits we have had recently). I tried to incorporate everything from the review. In particular:

  1. I synced docs of select! and select (and fixed some minor bugs in them)
  2. Rewritten the example in the manual
  3. made no-argument column selection not produce _ in front of function name when generating column name
  4. cleaned up tests a bit and added two new tests for passing the same column name in selector, i.e. [:x1, :x1] => - thing which is allowed because we do not care about column names in auto-splatting (but I did not want to merge tests of DataFrame and SubDataFrame into single loops because although similar they were slightly different and then I have tests grouped by data frame type - which is easier to debug in practice, as if I have a test failing in a loop then it is sometimes not immediate to know what actually failed)

@bkamins bkamins merged commit d98b9be into JuliaData:master Mar 19, 2020
@bkamins bkamins deleted the flexible_select branch March 19, 2020 12:14
@bkamins
Copy link
Member Author

bkamins commented Mar 19, 2020

Thank you for working on it. It was a long discussion. Let us hope people will find the new functionality useful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
non-breaking The proposed change is not breaking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants