-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
ENH: Implement DataFrame.select #61527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
||
Parameters | ||
---------- | ||
*args : hashable or tuple of hashable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also support a list of hashable ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the meaning of a list? Same as a tuple, for MultiIndex
?
1 Cooper Alice 22 | ||
2 Marley Bob 35 | ||
|
||
In case the columns are in a list, Python unpacking with star can be used: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of this - I'd prefer just passing the list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to it, and it was my first idea to support both df.select("col1", "col2")
and df.col(["col1", "col2"])
.
But after checking in more detail, I find the second version not so readable with the double brackets, and for the case when the columns are already in a variable just a star makes it work.
And besides readability, that to me would be enough reason to implement it like this, allowing a list adds a decent amount of complexity. For example, what would you do here? df.select(["col1", "col2"], "col3")
. Raise? Return all columns? What about this other case: df.select(["col1", "col2"], ["col3", "col4"])
Same as the previous? What about: df.select("col1", ["col2", "col3"])
. Personally, I think we shouldn't have to answer this, or make users guess much. The simplest approach seems to be good enough, if I'm not missing any use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'm open to it, and it was my first idea to support both
df.select("col1", "col2")
anddf.col(["col1", "col2"])
.
Why not support ONLY a list?
But after checking in more detail, I find the second version not so readable with the double brackets, and for the case when the columns are already in a variable just a star makes it work.
I think this is about consistency in the API. For example, with DataFrame.groupby()
, you can't do df.groupby("a", "b")
, you have to do df.groupby(["a", "b"])
.
And besides readability, that to me would be enough reason to implement it like this, allowing a list adds a decent amount of complexity. It's complexity in the implementation versus consistency of the API.
For example, what would you do here?
df.select(["col1", "col2"], "col3")
. Raise? Return all columns?
Raise. Only support lists or callables. And a static type checker would see that as invalid.
What about this other case:
df.select(["col1", "col2"], ["col3", "col4"])
Same as the previous?
Raise. And a static type checker would see that as invalid.
What about:
df.select("col1", ["col2", "col3"])
.
Raise. And a static type checker would see that as invalid.
Personally, I think we shouldn't have to answer this, or make users guess much. The simplest approach seems to be good enough, if I'm not missing any use case.
I don't see why a list isn't simple (and consistent), and it allows better type checking, as well as additions to the API in the future, if we should decide to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed feedback, what you say seems reasonable. To me, there is a significant advantage in readability and usability on using df.select("col1", "col2")
over df.select(["col1", "col"])
. I see you point on consistency with groupby
, and while still the list is not my favorite option, it does seem reasonable. I'll let others share their opinion too, as at the end there is a trade-off and is a question of personal preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with list-only.
Slight preference for (arg) over (*arg), strong preference for supporting one, not both. |
For reference, PySpark uses If in the future we implement df.select(new_col=pd.col("old_col") * 2)
df.select({"new_col": pd.col("old_col") * 2}) I'm personally not convinced by the reasons to use a list so far. Being consistent with I feel that we're making the API more complicated for users by using a list, and fail to see the reason for it so far. It'd be great to know more details about the reasons. I'm fine to update the PR if that's what everybody else things it's best. But it seems like a mistake to me so far. |
It's not just
My point here is that the convention we have in the rest of the API that requires a sequence/list of column names is to pass a list, and not use |
Thanks @Dr-Irv for putting together the list, I really appreciate, and it's very helpful. I'm not convinced, as all them are different cases. You can do We could also consider this for To me |
I'd be fine with having one column or a list of columns. But this also raises the following. Compare the following possibilities:
It seems to me there is an existing way of selecting a subset of columns that has a concise syntax, so why would I use |
|
Maybe I'm missing something, but using |
You're right, it's the inconsistency with everything else being a method that I don't think it's great: import pandas
(pandas.read_csv("taxi_data/yellow_tripdata_2015-01.csv")
.rename(columns={"trip_distance": "trip_miles"})
[["pickup_longitude", "pickup_latitude", "trip_miles"]]
.assign(distance_kms=lambda df: df.trip_miles * 1.60934)
.drop("trip_miles", axis=1)
.pipe(lambda df: df[df.distance_kms > 10.])
.to_parquet("long_taxi_pickups.parquet")) I would rather have this, so everything is a method, and the filter is explicit: import pandas
(pandas.read_csv("taxi_data/yellow_tripdata_2015-01.csv")
.rename(columns={"trip_distance": "trip_miles"})
.select("pickup_longitude", "pickup_latitude", "trip_miles")
.assign(distance_kms=lambda df: df.trip_miles * 1.60934)
.drop("trip_miles", axis=1)
.filter(lambda df: df.distance_kms > 10.)
.to_parquet("long_taxi_pickups.parquet")) To me the second option is significantly more intuitive and clear. And if in the future we decide to implement import pandas
(pandas.read_csv("taxi_data/yellow_tripdata_2015-01.csv")
.rename(columns={"trip_distance": "trip_miles"})
.select("pickup_longitude", "pickup_latitude", distance_kms=pandas.col("trip_miles") * 1.60934)
.filter(pandas.col("distance_kms") > 10.)
.to_parquet("long_taxi_pickups.parquet")) I think this is very similar to what PySpark and Polars, and I think they improved the pandas syntax, and I think we can improve it in the same way too. |
I'm not comfortable with having a named-keyword assignment in here. IIUC from the dev call there was another keyword from |
So this is just about a syntax preference. I'd probably continue to use the current methodology of But if you are trying to get people coming from Polars or PySpark to use pandas instead, then your proposal makes sense, although I think the API should accept either a |
Thanks for the feedback. This is something that Polars allows and I wanted to show as it makes the example cleaner. But surely not part of this PR, or any plan for the short term. We should have Sorry if I created more confusion with it. As said, select with keyword arguments is unrelated to this PR. |
You are correct @Dr-Irv, and of course |
So if you look at our docs at: https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#astype There is this example: dft[["a", "b"]] = dft[["a", "b"]].astype(np.uint8) If you are encouraging readability, would you suggest we update the docs/tutorials that use a pattern like the above to: dft[["a", "b"]] = dft.select("a", "b").astype(np.uint8) in order to encourage the use of I'm not sure encouraging that usage is a good idea in that particular example. Or what about at https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html#grouping , where we show df.groupby("A")[["C", "D"]].sum() If you think that df.groupby("A").select("A", "B").sum() so that the "suggested" syntax is the same for both selecting columns in a I understand your argument about consistency within method chains if On the other hand, if this is just syntactic sugar, with no promotion of its usage as "best practice", it's not a big deal to introduce it, but I don't necessarily agree that using |
It's just for method chaining that I think it's better. I don't think in isolation select is better. It's more explicit, but I don't think we should rewrite examples or encourage select. I think Good point about select for groupby, it does seem as a good idea, as a pipeline with method chaining and a group by will also be clearer and more consistent with select. |
To me the real issue here is that in method chains you'd like one operation on each line. But when you format your code, you get this: (
pandas.read_csv("taxi_data/yellow_tripdata_2015-01.csv")
.rename(columns={"trip_distance": "trip_miles"})[["pickup_longitude", "pickup_latitude", "trip_miles"]]
.assign(distance_kms=lambda df: df.trip_miles * 1.60934)
.drop("trip_miles", axis=1)
.pipe(lambda df: df[df.distance_kms > 10.])
.to_parquet("long_taxi_pickups.parquet"))
) In addition, the expression syntax in Polars and PySpark of allowing both args and kwargs is quite powerful. Instead of having to break up selection and assign into different calls you can do everything all at once. I also think |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Based on the feedback in #61522 and on the last devs call, I implemented
DataFrame.select
in the most simple way. It does work withMultiIndex
, but it does not support equivalents tofilter(regex=)
orfilter(like=
) directly. I added examples in the docs, so users can do that easily in Python (I can add one for regex if people think it's worth it).The examples in the docs and the tests should make quite clear what's the behavior, feedback welcome.
For context, this is added so we can make
DataFrame.filter
focus on filtering rows, for example:or
CC: @pandas-dev/pandas-core