-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow rename when selecting #1975
Conversation
Signed-off-by: lizz <lizz@sensetime.com>
Co-Authored-By: Bogumił Kamiński <bkamins@sgh.waw.pl>
Co-Authored-By: Bogumił Kamiński <bkamins@sgh.waw.pl>
Thank you for the proposal. I have two general comments: First, the PR has also changes from #1974 in it, this probably should be cleaned up at some point (this is a minor issue). Second (and this is a crucial point), I am not 100% sure that this composes well. I.e. maybe it is better to keep
and then the problem kicks in that we interpret it as |
79c7e57
to
1ebb978
Compare
Thanks for the comment. Let me describe the background story and motivation for the change. I started (for the first time) to use For example, when you want to modify two columns: julia> df = DataFrame(name=["vid1","vid2","vid3"], pred1=["walk","talk","run"],p1=[0.3,0.5,0.6],pred2=["walk","run","talk"],p2=[0.7,0.6,0.5],outpred=["","",""],outp=[0.,0.,0.])
3×7 DataFrame
│ Row │ name │ pred1 │ p1 │ pred2 │ p2 │ outpred │ outp │
│ │ String │ String │ Float64 │ String │ Float64 │ String │ Float64 │
├─────┼────────┼────────┼─────────┼────────┼─────────┼─────────┼─────────┤
│ 1 │ vid1 │ walk │ 0.3 │ walk │ 0.7 │ │ 0.0 │
│ 2 │ vid2 │ talk │ 0.5 │ run │ 0.6 │ │ 0.0 │
│ 3 │ vid3 │ run │ 0.6 │ talk │ 0.5 │ │ 0.0 │
julia> mask = df.p2 .> 0.5
3-element BitArray{1}:
1
1
0
julia> df[mask, [:outpred, :outp]] .= df[mask, [:pred2, :p2]] # their names must match
ERROR: ArgumentError: Column names in broadcasted data frames must match. Non matching column names are pred2, p2, outpred and outp
Stacktrace:
...
julia> df[mask, [:outpred, :outp]] .= df[mask, [:pred2=>:outpred, :p2=>:outp]] # <- look here
2×2 SubDataFrame
│ Row │ outpred │ outp │
│ │ String │ Float64 │
├─────┼─────────┼─────────┤
│ 1 │ walk │ 0.7 │
│ 2 │ run │ 0.6 │
julia> df
3×7 DataFrame
│ Row │ name │ pred1 │ p1 │ pred2 │ p2 │ outpred │ outp │
│ │ String │ String │ Float64 │ String │ Float64 │ String │ Float64 │
├─────┼────────┼────────┼─────────┼────────┼─────────┼─────────┼─────────┤
│ 1 │ vid1 │ walk │ 0.3 │ walk │ 0.7 │ walk │ 0.7 │
│ 2 │ vid2 │ talk │ 0.5 │ run │ 0.6 │ run │ 0.6 │
│ 3 │ vid3 │ run │ 0.6 │ talk │ 0.5 │ │ 0.0 │ So, the motivation is for the |
Signed-off-by: lizz <lizz@sensetime.com>
Signed-off-by: lizz <lizz@sensetime.com>
1ebb978
to
93219f4
Compare
Properly rebased. Only the last commit belong to this pr. |
Thank you for all the input. What I would propose to do before we move forward with the review of the implementation is to make sure what is the target API so that JuliaData memebers can comment on it. The reason is, as I have commented earlier that mixing selection (or So would be willing to write down the target API for this change (like what should go into the documentation)? In particular please note that it seems that this change will not combine will |
Just as a note |
Agree with you, but assessing the composability and consistency is beyond my knowledge (as I seldom use Data-related packages), so I'm willing to wait for inputs from anyone else. My intuition comes from SQL where renaming is often used with selection, e.g. SELECT day_of_order AS "Date" # <- a rename
Customer As "Client", # <- another rename
Product,
Quantity,
FROM orders; |
A quick search:
The SELECT Id AS Identifier,
LastName + ', ' + FirstName AS CustomerName, # <- also come with rename
FROM Customer Maybe we can change the syntax from @select(df, :c, x = :b + :c) into @select(df, :c, :b + :c => :x) |
OK - I will think about it and write down what I think is safe to do as I think this functionality is useful. Regarding the packages - the major alternative to DataFrames.jl is JuliaDB.jl related ecosystem. There you have it defined here. We are in the process of synchronizing the API across packages. That is the point of DataAPI.jl (actually I thought we migrated A major point to consider is that Also maybe we should only provide renaming in Finally, why do you think the syntax:
is not OK? |
The current version of select in here wrote select(t::Table, which::Selection)
`Selection` is a type union of many types that can select from a table. It can be:
...
* `Pair{Selection => Function}` – selects and maps a function over the selection, returns the result.
* `AbstractArray` – returns the array itself. This must be the same length as the table. # <- look here The Furthermore, it seems the form For the select(df, [:c, :b + :c => :x]) |
I agree there is a discrepancy, but my intuition is that it would be good to decrease the level of it with time. Regarding your proposal here is what I think:
The syntax could be:
which would take a source column, transform it using the specified transformation (a function accepting one argument and returning a value; here we should make sure to correctly handle with two shorthands:
expanding to:
and
expanding to:
If we went this way the following things should be reviewed in the whole code base against this change:
In particular I already see that:
and
should be reimplemented and all Overall, after thinking of it, I think your proposal would greatly improve the functionality of A simpler approach would be to implement this only for a single CC @nalimilan, @quinnj, @oxinabox, @pdeffebach, @piever (in case you wanted to comment on it as it is a major change 😄) |
Also now as I review #1849 by @pdeffebach then as an additional functionality (this does not have to be done now, but I want to keep things in a single place if we discuss the target design issues) is to allow things like:
or
as arguments to @pdeffebach (the difference here from what you proposed in #1849 is that it is |
I've always been critical of data frames libraries that combine Yes, I apologize for not working on that more. I totally agree that we should make the |
This change scares me. I'm not entirely sure I understand the use case. I've certainly done such operations in combination. But I don't think it would naturally occur to me to do them as one. One pattern I encounter is masking by one column then renaming other column. Trying it on.
Vs
I'm not sure. In the case of your original example.
|
I agree with the comments above that renaming should not be part of indexing. On the other hand, adding renaming to select was proposed in JuliaDB as well (on slack I'm afraid as I can't find the relevant issue). The advantage is that
In JuliaDB the syntax is |
I agree changing the semantic of |
@piever I proposed In my mental model this is a difference between Now if one accepts the:
approach then naturally all other cases follow i.e.
The major point if I understand the situation is that it would be breaking for JuliaDB.jl. Right? EDIT |
I feel like most of the stuff relating to this doesn't apply to DataFrames as much as it does to JuliaDB. So in DataFrames it is always more natural to write a series of
and renames. |
For DataFrames.jl we have a major decision:
And I am stating, that if we went for option 2, the natural syntax for |
@bkamins The second option is exactly what I've been working on a branch of my Selections.jl package, using only the Tables.jl. I think Here is the API I'm finishing right now for the select function:
(There are sensible default constructors for different combinations of types in the pairs supplied to select) where the key clause are "renamings" (functions applied to selected columns -- the "if clause" are the selections which can be The transform clause is the new thing I was working on -- here is a table that summarized what it can do:
Broadcasting plays nicely with your ability to chain multiple selections together (you can e.g. Finally, the thing I'm still working on is the ability to create new columns. For this I'd like to use the key word arguments as so:
This would create new column So the message is: I'm still working on this, if you want to discuss this, I'm |
Just a mention. When allowing column renaming we will have to be careful with duplicate column names produced. This is something we did not have to care about in the past, but now if someone writes |
Another thought. For: for Then we would say that The only drawback is that it is imaginable that in Does anyone have some thoughts over this? |
Let's throw an error for now, and (as always) we can change that later if it turns out to be useful.
In the context of |
The problem with So I now tend to thing that (summarizing the design):
Now as for duplicates of
(the only tricky case that I see now is when Does it sound good? If yes I will implement 2. ( |
I can do the |
Also, given how complicated this discussion has gotten, are we sure we don't want to just have 3 functions, I should note that if |
Yeah - this is kind of what I had in my mind originally in #1975 (comment). We have two options in general: Option 1Make Option 2Leave The only difference is that in transform we could stick to the The reason it is fully valid is that as of Julia 1.3 we can write:
which means that we do not have a problem that column name is not a proper identifier. As noted above - I prefer Option 2 (it is a cleaner design for me and does not require changing the rest of the API), but I see the point of Option 1, as it is SQL-like select. |
After a discussion with @nalimilan we have settled to use "powerful select" option. In particular a simple transform of a column (adding a column to a data frame) will be performed by:
or
We just have to decide what we do when @nalimilan - would you want to add some more comment about your thinking about the issue (i.e. the issue of specialized vs powerful |
Sorry if this has been already discussed to death (and I may have argued differently in the past), but I'm starting to worry that the
Starting from this, I find a bit hard to guess what Or is the argument that, in general, I suspect it may be safer to even disallow If one does not want |
It is better to discuss it now than later. My understanding is:
also you have not asked but there are two other "shorthands" that are planned to be allowed
Also |
Just to expand on two important design considerations:
|
I agree. The name of the resulting column (if provided) should always be the right-most element. So OTOH we could use a special syntax to allow using external data, for example
Currently in |
I drop some comments (this post does not try to propose a fully consistent specification but rather notes some observations - and all of them do not have to be introduced all at once, but maybe they should so that we are sure that we are consistent from the start). In general my notes below base on the fact that whatever we pass to
I agree it should be not allowed not to create confusion, but I just realized that if we rewrite it as So essentially
Actually it should be
So the only decision here is to what
in case of vector
I agree this is my major design painpoint. A first comment is that we should add Now which default we take with an implicit column name is something that is non obvious. Especially as |
+1
Probably the second one is more useful. But let's disallow that for now.
Something we might want to change is that for two-arguments functions we could concatenate the name of the function and the names of the columns, e.g. |
We should take advantage of broadcasting the
|
Sorry, I don't get your point. Since the function takes two arguments, why would you want to broadcast? Also, what's "1404"? |
sorry I think I misunderstood the above conversation. #1404 was just dumb github autocomplete. Nevermind my above comment, though I will advocate for broadcasted |
We should add broadcasting into consideration now, to be able to define a comprehensive solution:
Should be rewritten as:
which should be rewritten as:
and should be equivalent to (this point is new in the discussion as it assumes we would unpack
Now the additional thing is that if the first argument is wrapped in |
This is an initial writeup of the new functionality we discussed:
What is missing / important notes (left it out to have a gradual implementation):
|
In #2080 I have proposed a first step towards a full blown select/transform combo |
Okay, thanks for the effort! I'm still busy these days, so please go ahead |
Based on the previous pr #1974.
Now it supports