-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Dedicated method for creating conditional columns #39154
Comments
One to add to the list is
|
One option, using df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df.assign(c = case_when(df.a < 2, 0, df.b))
# multiple conditions :
df.assign(d = case_when (df.a < 2, 0, # first condition, result
df.b < 2, 1, # second condition, result
df.a)) # default The idea is inspired by if_else in |
This looks like a clean method for the api, that means this would be a general function eg Instead I think a |
>>> import pandas as pd
>>> from datar.all import f, mutate, case_when, if_else
>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
>>> df >> mutate(c=if_else(f.a < 2, 0, f.b))
a b c
<int64> <int64> <float64>
0 1 4 0.0
1 2 5 5.0
2 3 6 6.0
>>> df >> mutate(d=case_when(f.a < 2, 0, f.b < 2, 1, True, f.a))
a b d
<int64> <int64> <float64>
0 1 4 0.0
1 2 5 2.0
2 3 6 3.0 |
would be averse to adding |
I agree on the latter, I started with an attempt to implement this and ran into quite some complexity fast. But given the original post and problem statement, wouldn't you agree a "data wrangling" module needs a dedicated method to create conditional columns? It is an action you perform quite often when transforming data. |
was a typo, 'wouldn't be averse'. that said this is a substantial effort and would have to be designed properly. |
@erfannariman what complexity did you encounter? |
The first implementation of creating a conditional column was relatively easy, but then the design choices around numpy dtypes vs pandas extension dtypes. Also passing all the tests and edge cases + mypy type checks. And But to be fair, the attempt was relatively short compared to what is needed to implement this. |
If you do not mind @erfannariman kindly share the edge cases, as well as the ones regarding dtypes issues so I can test on my end. |
Any decision made here on what method to proceed with? |
If I'm understanding the code correctly, those lambdas will be called row-by-row. In that case, they would be highly inefficient when compared to vectorized operations. |
Isn't it the same row-by-row assignment when we do:
The columns above get assigned one by one, but in the assignment of the column itself, we have vectorisation. I believe this is similar to what was proposed in (issue: #46285, PR: #46286). If you agree, I can re-open the PR and we can drive this feature forward together. Thanks. |
@erfannariman just checking if you are still working on this. If you are not working on it, I can take over and raise a PR |
@ELHoussineT I see, I missed that you're allowing the values to be Series as well as callable. It seems to me this API would not be able to handle the example in #39154 (comment), is that correct? |
@rhshadrach @ELHoussineT , one option, would be to create a standalone |
I believe the only missing feature is the "default" value. I can tweak this API to support that and add some tests. If you agree, I will open a PR. |
@ELHoussineT - can you show what #39154 (comment) would look like with your suggested signature? My understanding is that the keys of the dictionary are to become the values in the result. I don't see how you can use a Series in a straight-forward manner as the values in the result. |
Kindly allow me to reply ~ by tomorrow.
…On Fri, 21 Oct 2022, 5:20 am Richard Shadrach, ***@***.***> wrote:
@ELHoussineT <https://github.com/ELHoussineT> - can you show what #39154
(comment)
<#39154 (comment)>
would look like with your suggested signature? My understanding is that the
keys of the dictionary are to become the values in the result. I don't see
how you can use a Series in a straight-forward manner as the values in the
result.
—
Reply to this email directly, view it on GitHub
<#39154 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AE5NYIN6VATA6SZ7PJR6DQTWEIDXNANCNFSM4WBOHW3A>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Alright, let's start by explicitly describing the requirements for the functionality we want to build. Please feel free to suggest otherwise. RequirementsThe functionality should satisfy the following:
OptionsNow that we agreed - on a high level - on the requirements. IMHO, here are our options: Option 1:What has been suggested by @samukweku in their comment. Example:
Option 2:We follow what Spark did with their Example:
Please note that if we choose this option, upon implementation, we can rename My opinionI would grant it to option 2 as IMHO it is more intuitive compared to option 1. I would also be happy to hear what you think. Let me know your thoughts so we can choose and implement a solution. |
How do these differ from
Also, I don't see how we can implement Edit: Ah, option 1 allows the default. Am I to assume that condition 2 will only be utilized where condition 1 was false? If this is the case, is it still evaluated where condition 1 is true? This would have implications if you're doing some op that may raise on rows where condition 1 is true. |
I can speak about option1. It uses the same idea as np.select, where all conditions are evaluated, and values assigned based on first match, using pd.mask. so if condition 1 is true and condition 2 is true, condition 1 takes precedence. If all fails, then the default kicks in. Basically a for loop, just like in np select, but using pd.mask. Hopefully, I explained it well enough. An implementation is here and an example from stackoverflow |
I can see at least 1 major difference. Please correct me if I understood incorrectly; but in the method you mentioned, you cannot reference other rows. For example, you cannot do the following:
Good point, you are right. But as @samukweku this not handled in |
Good point, I missed that |
Given that I don't think Option 2 is viable (happy to be wrong here if there is a working implementation), Option 1 as linked to here looks good to me. However, I think we should drop the Another question about the API is what to do with
These all seem like viable options to me, it's not clear which is to be preferred. Any thoughts? |
Yes, a pd.Series returned makes sense; the user can do whatever they want with it. I'd just set the default to |
Nice. @samukweku Will you pick this one or shall I give it a shot? Thanks. |
@ELHoussineT I'll give it a shot. Thanks. @erfannariman was already working on it, and I do not know the steps involved in taking over. @rhshadrach @erfannariman ok if i make a PR based on our conversations? |
@samukweku - most certainly! |
take |
@samukweku Thanks! Let me know if you need any help. |
@ELHoussineT sure .. i'll start on it |
@ELHoussineT pls kindly take over with regards to this PR. Thanks |
I've been busy with work, but I would still like to give this one more serious try if it's okay with the others. |
@erfannariman Sure thing, if you decided to stop working on it, please ping me to start. |
@erfannariman any progress so far? Got some time on me now for this |
This is my first contribution to Pandas core, I would kindly ask for your guidance and patience to make it happen as it will somehow make me happy if I successfully contributed to Pandas core. Can you please confirm your agreement with the initial approach in #50343? Usage example:
Approach advantages:
Open questions:
Next steps:
Credit:
|
ping @erfannariman or @rhshadrach |
You need to also modify
I think you're saying the default value of the |
@rhshadrach Thanks for your input I've added Docs and tests are remaining, will add those and move the PR from draft to review. |
I took a shot at docs and tests, can you please take a final look? @rhshadrach |
Thanks for picking this up, do you mind to keep the discussion in your PR, so no need to comment here for every update. The core devs will respond when they have time. This way we keep this issue clean. @ELHoussineT |
Is it ok if I pick this up and work on it? |
take |
Creating conditional columns is an operation you perform a lot when wrangling data. Still right now there is no dedicated and explicit way of creating conditional columns in pandas. This can be seen on StackOverflow which is my top hit when I google "pandas conditional column" (this will obviously vary per person).
On this StackOverflow link we see the result of not being explicit since there are at least 6 different methods mentioned and a lot of anti patterns, also methods we do not want users to use because of inefficiency (
apply
with if else,map
withlambda x
for example).This ticket is made as a starting point for discussion if we should have a more explicit way of creating conditional columns in pandas. Also because we need to look ahead when switching to EA's because the most used methods (and top accepted answer on SO) are numpy methods and return numpy arrays.
What do we have now?:
For single conditional columns:
np.where
.loc
with assignmentSeries.where
Series.mask
, see commentFor multi conditional columns:
np.select
.apply
with if else (not an option imo)I don't have a suggestion yet, but I know I really like the clean code you can create with
np.select
and also with SQL CASE WHEN.Other languages:
SQL has CASE WHEN:
R (dplyr) has case_when:
Related tickets / questions:
The text was updated successfully, but these errors were encountered: