Add transformation and renaming to select and select! #2080

bkamins · 2020-01-06T22:00:54Z

This is a first step in implementation of the discussion in #1975.

The functionality is limited, but I first wanted to make sure that on this level we are in agreement. If this is so I will add tests to this PR. After this is merged we can talk about adding more functionalities in separate PRs (even this new functionality is complex enough that writing proper tests for it will be a challenge, so I want to move forward slowly).

Also even with what I propose here a lot can be achieved, actually what is missing is only:

Col wrapper for whole-column functions (we did not have an agreement about the name)
~~automatic generation of column names if they are missing (here we first have to set a common rules for all functions in DataFrames.jl)~~ (actually I have done it for the case of one column, but just forgotten that it is there 😄, we still have to decide on column name generation for multiple columns)
passing several columns to a function (this would be nice but I think it can wait as it is not a supper common case)
nesting transformations in vectors and broadcasting of transformations (it would be nice to have it, but first we have to resolve broadcasting for Not)

nalimilan

Thanks for starting this! I've made a few high-level comments.

If you have the time, it would make sense IMO to also implement support for several input columns in this PR. The structure of the code will probably be affected so better get it right from the start.

src/abstractdataframe/selection.jl

bkamins · 2020-01-08T11:16:26Z

implement support for several input columns in this PR.

I can add it (I have tried to design the code to make such additions easy - normalize_selection has exactly the purpose to make it simple).

Let us just agree on one thing. In the case of multiple columns passed do we want to pass NamedTuples of values in rows to the function? So essentially the operation would be:

fun.(getfield.(Tables.rows(Tables.columntable(df[!, seected_columns])), :columns))

(this should be efficient)

or we want to pass DataFrameRows:

fun.(eachrow(df[!, seected_columns]))

(this avoids recompilation each time we make a selection)

The API except for the type is essentially the same with the difference that we pass a different typee and the latter potentially allows the function that takes DataFrameRow to potentially mutate source data frame.

What do you think?

bkamins · 2020-01-08T18:39:53Z

Thank you for the comments. For now we have the following key open issues:

if we want to keep transform or leave it out for now
if we pass multiple columns for transformations if we should pass NamedTuple or DataFrameRow to the function
how to generate column name if we pass multiple columns

I leave Col wrapper to make passing whole-columns for later. However it might guide NamedTuple or DataFrameRow decision. Because in Col wrapper the equivalent decision is if we pass a NamedTuple of abstract vectors vs. passing a SubDataFrame.

(also probably you have an opinion given the design in combine 😄)

nalimilan · 2020-01-09T09:08:22Z

I think we should pass a NamedTuple, just like in combine. Otherwise things are going to be quite slow. Though at some point we could support an argument or a syntax to use DataFrameRow.

Regarding the generation of column names, I guess it could make sense to list the names of all input variables up to a threshold (e.g. 3)? One tricky case is when the input columns are selected via : or a regex (or another kind of selection rule): is that a good idea to use the name of the columns, since they are relatively hard to find for the caller?

I don't have a strong opinion about transform, but I tend to prefer keeping PRs as small as possible. :-)

bkamins · 2020-01-09T12:41:13Z

OK - I have made all the changes:

removed transform
added auto-generation of column names (please see what I proposed - if we agree to this I would use the same in combine in the future)
started using NamedTuple in multi-column selection (but I have a small question there - I have added a separate question there)

src/abstractdataframe/selection.jl

bkamins · 2020-01-10T17:24:35Z

@nalimilan - I think I have addressed all the design issues in this PR. The only question is if you are OK with multiple-column automatic naming scheme I have proposed in https://github.com/JuliaData/DataFrames.jl/pull/2080/files#diff-ac2eb247bb3d79f652033279061a1ceaR40.

When we confirm we are OK with the design I will add tests of the new functionality (I keep things pending in TODO at the top of the file). Essentially for the two other PRs the things that will be left to do are Col wrapper and transform functions.

src/abstractdataframe/selection.jl

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2020-01-10T18:58:02Z

Yeah - this PR is not easy (and this is what I have expected). Please let me know when you will be clear with the design decisions I proposed and I will then start writing tests.

src/abstractdataframe/selection.jl

bkamins · 2020-01-13T11:34:36Z

Just to sum up the current pending design decisions:

automatic column naming scheme (if we agree on the rule I used here currently I would synchronize it with combine to have consistency - this would be breaking)
allowing single column selection duplicates in select and in All (i.e. if All(:a, :a) should be an error or be allowed and select a single column :a)

When we settle this I will update the docs and write tests for the whole PR.

bkamins · 2020-01-15T12:58:48Z

Change agreed with @piever is pushed now.

@nalimilan - so we are left with confirming that you agree to the automatic column naming rules I proposed in this PR (they then should be ported to combine for consistency).

nalimilan · 2020-02-29T12:12:41Z

Likewise, I'd throw an error for now. :-)

bkamins · 2020-02-29T12:15:37Z

OK - so I will restrict the list of what is allowed to be returned to match combine. The good thing is that the user always can write [named_tuple_or_similar] to get what one wants.

…rix as return values of fun.

bkamins · 2020-03-02T23:09:51Z

I have finished working on the implementation, documentation and tests of the new functionality. So comments would be appreciated.

Apart from implementing the core functionality I have caught some small things that needed to be polished in other sections of the source codes (chiefly indexing of SubDataFrame with All and Between).

bkamins · 2020-03-07T21:20:42Z

One last (hopefully) design decision given the consistency with combine consideration.

If select(df, cols => fun) is written and fun returns AbstractDataFrame, NamedTuple, DataFrameRow, AbstractMatrix maybe we should already allow it and make it create multiple columns - like combine. The question is relevant because allowing this would change the internal design.

nalimilan · 2020-03-07T21:45:34Z

For now I think it's OK to throw an error, the internal design can be changed later as long as it doesn't affect user-visible behavior.

bkamins · 2020-03-07T23:19:21Z

OK - so in this case this PR is ready for a review 😄.

bkamins · 2020-03-12T10:40:11Z

TODO: make src => fun => dst not copy what fun returned even if copycols=true unless we can check that the return value is === to any elements of src or it is a SubVector.

bkamins · 2020-03-12T16:10:47Z

I have changed to "optimistic" mode in copycols=true case. Things get tricky here, so a careful look at the last commit would be welcome. Thank you.

nalimilan

Sorry for the delay. Looks quite good, just a few more details.

docs/src/man/getting_started.md

src/abstractdataframe/selection.jl

test/select.jl

src/abstractdataframe/selection.jl

pdeffebach · 2020-03-18T15:01:31Z

I just checked this out and played around. Doing

select(df, :, [:a, :b] => + => :c)

throws

julia> select(df, :, [:a, :b] => + => :c)
ERROR: syntax: "=>" is not a unary operator
Stacktrace:
 [1] top-level scope at REPL[27]:0

Would there be a way to check if a function is a unary operator and, if so, wrap it in parentheses? Or would that require metaprogramming.

bkamins · 2020-03-18T17:03:51Z

Would there be a way to check if a function is a unary operator and, if so, wrap it in parentheses? Or would that require metaprogramming.

We cannot do anything about it, even + => something does not parse correctly. One has to wrap + in (+) as you note.

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2020-03-18T22:43:52Z

Thank you for the review (this seems to be the longest PR in terms of comments and commits we have had recently). I tried to incorporate everything from the review. In particular:

I synced docs of select! and select (and fixed some minor bugs in them)
Rewritten the example in the manual
made no-argument column selection not produce _ in front of function name when generating column name
cleaned up tests a bit and added two new tests for passing the same column name in selector, i.e. [:x1, :x1] => - thing which is allowed because we do not care about column names in auto-splatting (but I did not want to merge tests of DataFrame and SubDataFrame into single loops because although similar they were slightly different and then I have tests grouped by data frame type - which is easier to debug in practice, as if I have a test failing in a loop then it is sometimes not immediate to know what actually failed)

bkamins · 2020-03-19T12:15:14Z

Thank you for working on it. It was a long discussion. Let us hope people will find the new functionality useful!

bkamins added 2 commits January 6, 2020 22:42

add support for transforms in select and define transform and transform!

77f4623

fix SubDataFrame select signature

147a427

bkamins mentioned this pull request Jan 6, 2020

Allow rename when selecting #1975

Closed

bkamins added 2 commits January 6, 2020 23:08

fix problem in autogeneration of column names

11fd0a2

add documentation of automatic generation of column names

6fa4f84

nalimilan reviewed Jan 7, 2020

View reviewed changes

improvements after code review

d501fb4

bkamins changed the title ~~Add transformation and renaming to select; define transform and transform!~~ Add transformation and renaming to select Jan 9, 2020

updates after a code review

7053e5b

bkamins commented Jan 9, 2020

View reviewed changes

src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved

bkamins added 4 commits January 10, 2020 09:10

correct variable name

ec834e2

minor fix

6c76aca

fix select for SubDataFrame

e59d129

improved multiple column transformation

dee8ac7

improve select for SubDataFrame

4a8a40b

nalimilan reviewed Jan 10, 2020

View reviewed changes

bkamins and others added 2 commits January 10, 2020 19:21

Apply suggestions from code review

f04a549

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

fixes after code review

bbc06f4

nalimilan reviewed Jan 12, 2020

View reviewed changes

src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved

src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved

src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved

fixes from code review

498d9df

disallow duplicates in single column selection

fa5a1f1

fix select for SubDataFrame to avoid duplicate ColumnIndex selelctions

cd8f41b

pdeffebach mentioned this pull request Feb 29, 2020

Standardizing working with multiple columns #2016

Closed

bkamins added 8 commits February 29, 2020 20:23

disallow AbstractDataFrame, NamedTuple, DataFrameRow, and AbstractMat…

df59216

…rix as return values of fun.

fix test

d932b05

clean up transformation implementation

688b077

further sanitizing select rules and more code explanations

c34ee72

fix comments

b818d57

tests of disallowed values

08d4043

finalize tests

49dff0e

fix Julia 1.0 tests

d685576

bkamins changed the title ~~WIP: add transformation and renaming to select~~ Add transformation and renaming to select and select! Mar 2, 2020

stop doing pessimistic copy when copycols=true

35f8996

nalimilan reviewed Mar 18, 2020

View reviewed changes

bkamins and others added 2 commits March 18, 2020 22:40

Apply suggestions from code review

78b492d

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

fixes after code review

52e690d

improve docstring

20642c5

nalimilan approved these changes Mar 19, 2020

View reviewed changes

bkamins merged commit d98b9be into JuliaData:master Mar 19, 2020

bkamins deleted the flexible_select branch March 19, 2020 12:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add transformation and renaming to select and select! #2080

Add transformation and renaming to select and select! #2080

bkamins commented Jan 6, 2020 •

edited

Loading

nalimilan left a comment

bkamins commented Jan 8, 2020

bkamins commented Jan 8, 2020

nalimilan commented Jan 9, 2020

bkamins commented Jan 9, 2020

bkamins commented Jan 10, 2020

bkamins commented Jan 10, 2020

bkamins commented Jan 13, 2020

bkamins commented Jan 15, 2020

nalimilan commented Feb 29, 2020

bkamins commented Feb 29, 2020

bkamins commented Mar 2, 2020

bkamins commented Mar 7, 2020

nalimilan commented Mar 7, 2020

bkamins commented Mar 7, 2020

bkamins commented Mar 12, 2020

bkamins commented Mar 12, 2020

nalimilan left a comment

pdeffebach commented Mar 18, 2020

bkamins commented Mar 18, 2020

bkamins commented Mar 18, 2020

bkamins commented Mar 19, 2020

Add transformation and renaming to select and select! #2080

Add transformation and renaming to select and select! #2080

Conversation

bkamins commented Jan 6, 2020 • edited Loading

nalimilan left a comment

Choose a reason for hiding this comment

bkamins commented Jan 8, 2020

bkamins commented Jan 8, 2020

nalimilan commented Jan 9, 2020

bkamins commented Jan 9, 2020

bkamins commented Jan 10, 2020

bkamins commented Jan 10, 2020

bkamins commented Jan 13, 2020

bkamins commented Jan 15, 2020

nalimilan commented Feb 29, 2020

bkamins commented Feb 29, 2020

bkamins commented Mar 2, 2020

bkamins commented Mar 7, 2020

nalimilan commented Mar 7, 2020

bkamins commented Mar 7, 2020

bkamins commented Mar 12, 2020

bkamins commented Mar 12, 2020

nalimilan left a comment

Choose a reason for hiding this comment

pdeffebach commented Mar 18, 2020

bkamins commented Mar 18, 2020

bkamins commented Mar 18, 2020

bkamins commented Mar 19, 2020

bkamins commented Jan 6, 2020 •

edited

Loading