-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DEPR: pd.concat special cases DatetimeIndex to sort even when sort=False #57335
Comments
+1 on (1) - different sorting defaults for different dtypes is very surprising. For (2), I do prefer concat's way of maintaining order by starting with the first frame and tacking on any new labels to the end as it proceeds through the list. When this order is desired, I think this behavior is harder for the user to replicate. |
My vote very much goes to option 2. Setting I'm more than happy however to provide a slew of examples of |
I always use "label-style" index and columns.
Any DataFrame where the order of the rows/columns contains information, such as a sequence of events.
And that is still quite easy to attain: |
However, if we sorted the non-concatenating index by default, how would users get the alternative order back? That isn't so easy.
By using the `sort=False` argument in `pd.concat`, no? Indeed I'm not saying the sort kwarg should be ignored as it used to be, I'm just arguing that the default of sorting seems very reasonable to me in order to keep a vast majority of code short and sweet.
The label-only cases surely are out there, but given how much timeseries functionality has been built into Pandas I tend to be convinced that many usecases these days have a time dimension on one of the axes. For what it's worth, essentially all of mine do.
Are you suggesting that otherwise anyone working almost exclusively with timeseries-like data will always have to invoke pd.concat with subsequent explicit sort calls? That seems rather verbose and error-prone to me. There surely then must be a better way to safely assemble DataFrames in such a setting; maybe concat is not the right thing to use?
|
Perhaps I'm misunderstanding. @lukemanley opened this issue with the line:
Then presented two options to rectify the inconsistency. The first is to treat
and this is the one you agreed with in #57335 (comment). Perhaps "always" doesn't mean what I think it means?
No - |
Rereading your previous comment, it seems to me you're advocating changing the default of If, on the other hand, you're advocating for a dtype-dependent sorting default ( |
Yeah was maybe not being clear here: by "always" I really meant when
Yeah I can see your point re. surprising behaviour and complexity here. I'd be quite content with
Sure, but unless one forgets to call that it will end up unsorted leading likely to failures that give erroneous results but without throwing an error, i.e. the worst kind of bug (again, coming form the timeseries perspective here). Putting what I'm saying differently: don't you think that sorting by default is safer than not sorting by default? Sure, the former might have unnecessary overheard in some cases, but those can be solved by then explicitly setting Come to think of it, I would actually also argue that the behaviour of appending indices on the non-concatenation axis is not 100% transparent either: if |
As mentioned above, I recommend opening a new issue if you'd like to pursue changing the default value of sort. |
pd.concat
has a specific behavior to always sortDatetimeIndex
whenjoin="outer"
and the non-concatentation axes are not aligned. This was undocumented (prior to 2.2.1) and is inconsistent with all other index types and data types including other temporal types such as pyarrow timestamps/dates/times.An attempt to treat this as a bug fix highlighted that this has been long-standing behavior that users may be accustomed to. (xref #57006)
Here are two options that come to mind:
Deprecate the existing behavior and do not sort by default for
DatetimeIndex
. This would simplify things by removing the carve out and treating all index types / data types consistently. This would require users to explicitly passsort=True
when concatenating two frames with monotonic indexes if they wanted to ensure a monotonic result.Always sort the non-concatenation axis when
join="outer"
(for all dtypes). This would be consistent with howpd.merge
andDataFrame.join
handle outer merges and in practice may be more useful behavior since concatenating two frames with monotonic indexes will return a frame with a monotonic index.The text was updated successfully, but these errors were encountered: