Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: pd.concat special cases DatetimeIndex to sort even when sort=False #57335

Open
lukemanley opened this issue Feb 10, 2024 · 8 comments
Open
Labels
Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action Reshaping Concat, Merge/Join, Stack/Unstack, Explode Sorting e.g. sort_index, sort_values
Milestone

Comments

@lukemanley
Copy link
Member

pd.concat has a specific behavior to always sort DatetimeIndex when join="outer" and the non-concatentation axes are not aligned. This was undocumented (prior to 2.2.1) and is inconsistent with all other index types and data types including other temporal types such as pyarrow timestamps/dates/times.

An attempt to treat this as a bug fix highlighted that this has been long-standing behavior that users may be accustomed to. (xref #57006)

Here are two options that come to mind:

  1. Deprecate the existing behavior and do not sort by default for DatetimeIndex. This would simplify things by removing the carve out and treating all index types / data types consistently. This would require users to explicitly pass sort=True when concatenating two frames with monotonic indexes if they wanted to ensure a monotonic result.

  2. Always sort the non-concatenation axis when join="outer" (for all dtypes). This would be consistent with how pd.merge and DataFrame.join handle outer merges and in practice may be more useful behavior since concatenating two frames with monotonic indexes will return a frame with a monotonic index.

@lukemanley lukemanley added Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action labels Feb 10, 2024
@lukemanley lukemanley added this to the 3.0 milestone Feb 10, 2024
@rhshadrach
Copy link
Member

+1 on (1) - different sorting defaults for different dtypes is very surprising.

For (2), I do prefer concat's way of maintaining order by starting with the first frame and tacking on any new labels to the end as it proceeds through the list. When this order is desired, I think this behavior is harder for the user to replicate.

@torext
Copy link
Contributor

torext commented Feb 15, 2024

My vote very much goes to option 2. Setting sort=False then should behave the way @rhshadrach describes his preferred concat behaviour. Though I really must say, I don't quite see how often that is useful in real-word examples. Given how much time series-related functionality is built into Pandas I'm surprised option 2 isn't considered the more reasonable use case. In all the applications I see Pandas used it's very rare to have a DataFrame with both "label-style" index and columns; usually one of the two is rather a time dimension. Maybe @rhshadrach can provide concrete real-world examples where the non-sorting behaviour is desireable on a true outer join? I personally can't think of any good ones...

I'm more than happy however to provide a slew of examples of concat usage where one of the axes represents a time dimension and, consequently, maintaining sorting on an outer join along the other axis is paramount. Maybe however I'm using the wrong tool? I.e. should I be using merge in those cases?

@rhshadrach
Copy link
Member

In all the applications I see Pandas used it's very rare to have a DataFrame with both "label-style" index and columns; usually one of the two is rather a time dimension.

I always use "label-style" index and columns.

Maybe @rhshadrach can provide concrete real-world examples where the non-sorting behaviour is desireable on a true outer join?

Any DataFrame where the order of the rows/columns contains information, such as a sequence of events.

I'm more than happy however to provide a slew of examples of concat usage where one of the axes represents a time dimension and, consequently, maintaining sorting on an outer join along the other axis is paramount.

And that is still quite easy to attain: df.sort_index(axis="index") and df.sort_index(axis="columns"). However, if we sorted the non-concatenating index by default, how would users get the alternative order back? That isn't so easy.

@torext
Copy link
Contributor

torext commented Mar 3, 2024 via email

@rhshadrach
Copy link
Member

rhshadrach commented Mar 7, 2024

By using the sort=False argument in pd.concat, no?

Perhaps I'm misunderstanding. @lukemanley opened this issue with the line:

pd.concat has a specific behavior to always sort DatetimeIndex when join="outer" and the non-concatentation axes are not aligned.

Then presented two options to rectify the inconsistency. The first is to treat DatetimeIndex as all other dtypes are treated today. The second was:

Always sort the non-concatenation axis when join="outer" (for all dtypes).

and this is the one you agreed with in #57335 (comment). Perhaps "always" doesn't mean what I think it means?

Are you suggesting that otherwise anyone working almost exclusively with timeseries-like data will always have to invoke pd.concat with subsequent explicit sort calls?

No - sort=True will still sort.

@rhshadrach
Copy link
Member

Rereading your previous comment, it seems to me you're advocating changing the default of sort in concat to True across the board for all dtypes. If that's the case, I think it should be discussed in a separate issue - this one is about rectifying a specific inconsistency with DatetimeIndex.

If, on the other hand, you're advocating for a dtype-dependent sorting default (sort=True for DatetimeIndex, False for other dtypes), then I think this is too complex and surprising of a default.

@rhshadrach rhshadrach added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Sorting e.g. sort_index, sort_values labels Mar 7, 2024
@torext
Copy link
Contributor

torext commented Mar 18, 2024

Perhaps "always" doesn't mean what I think it means?

Yeah was maybe not being clear here: by "always" I really meant when sort=True and to have that as the default, at least in the DateTimeIndex case.

If, on the other hand, you're advocating for a dtype-dependent sorting default (sort=True for DatetimeIndex, False for other dtypes), then I think this is too complex and surprising of a default.

Yeah I can see your point re. surprising behaviour and complexity here. I'd be quite content with sort=True by default across the board indeed. As I mentioned, .groupby() already has sort=True by default.

No - sort=True will still sort.

Sure, but unless one forgets to call that it will end up unsorted leading likely to failures that give erroneous results but without throwing an error, i.e. the worst kind of bug (again, coming form the timeseries perspective here). Putting what I'm saying differently: don't you think that sorting by default is safer than not sorting by default? Sure, the former might have unnecessary overheard in some cases, but those can be solved by then explicitly setting sort=False; the same goes for the cases where one wants to "preserve" order (i.e. the appending behaviour).

Come to think of it, I would actually also argue that the behaviour of appending indices on the non-concatenation axis is not 100% transparent either: if concat is called on e.g. a dictionary, then one tends to think of that as an unsorted collection of elements to be concatenated, yet their implicit order will inlfuence the outcome of the non-concatenation axis when sort=False. You see what I mean?

@rhshadrach
Copy link
Member

I'd be quite content with sort=True by default across the board indeed. As I mentioned, .groupby() already has sort=True by default.

As mentioned above, I recommend opening a new issue if you'd like to pursue changing the default value of sort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action Reshaping Concat, Merge/Join, Stack/Unstack, Explode Sorting e.g. sort_index, sort_values
Projects
None yet
Development

No branches or pull requests

3 participants