Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: timestamp resolution inference - default to one unit (if possible) instead of being data-dependent? #58989

Open
jorisvandenbossche opened this issue Jun 12, 2024 · 12 comments
Labels
Datetime Datetime data dtype Non-Nano datetime64/timedelta64 with non-nanosecond resolution Timestamp pd.Timestamp and associated methods
Milestone

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 12, 2024

After #55901, we now do inference of the best resolution, and so allow to create non-nanosecond data by default (instead of raising for out of bounds data).

To be clear, it is a very nice improvement to stop raising those OutOfBounds errors while the timestamp would perfectly fit in another resolution. But I do think we could maybe reconsider the exact logic of how to determine the resolution.

With the latest changes you get the following:

>>> pd.to_datetime(["2024-03-22 11:43:01"]).dtype
dtype('<M8[s]')
>>> pd.to_datetime(["2024-03-22 11:43:01.002"]).dtype
dtype('<M8[ms]')
>>> pd.to_datetime(["2024-03-22 11:43:01.002003"]).dtype
dtype('<M8[us]')

The resulting dtype instance depends on the exact input value (not type). I do think this has some downsides:

  • The result dtype becomes very data dependent (while in general we want to avoid value dependent behavior)
  • You can very easily get multiple datetime dtypes in a workflow, causing more casting (to different unit) than necessary

The fact that pandas by default truncates the string repr of datetimes (i.e. we don't show the subsecond parts if they are all zero, regardless of the actual resolution), in contrast to numpy, also means that round-tripping through a text representation (eg CSV) will very often lead to a change in dtype.

As a potential alternative, we could also decide to have a fixed default resolution (e.g. microseconds), and then the logic for inferring the resolution could be: try to use the default resolution, and only if that does not work (either out of bounds or too much precision, i.e. nanoseconds present), use the inferred resolution from the data.

That still gives some values dependent behaviour, but I think this would make it a lot less common to see. And using a resolution like microseconds is sufficient for by far most use cases (in terms of bounds it supports: [290301 BC, 294241 AD])

@jorisvandenbossche jorisvandenbossche added Datetime Datetime data dtype Non-Nano datetime64/timedelta64 with non-nanosecond resolution Timestamp pd.Timestamp and associated methods labels Jun 12, 2024
@jorisvandenbossche
Copy link
Member Author

cc @pandas-dev/pandas-core

@WillAyd
Copy link
Member

WillAyd commented Jun 17, 2024

This sounds reasonable and I think could help simplify the implementation

@bashtage
Copy link
Contributor

I think this would be an improvement. Seems like a good idea that anyone working with human times (say down to second precision) with a range of the modern era would get the same basis for the timestamp.

@jorisvandenbossche
Copy link
Member Author

I think could help simplify the implementation

I don't think it will simplify things generally, because we still need the current inference logic when the default unit does not fit, but from looking a bit into it, I also don't think it should make the code much more complex.

@Pranav-Wadhwa
Copy link
Contributor

take

@Pranav-Wadhwa
Copy link
Contributor

Based on discussions, I will update to_datetime to always use nanoseconds in the given scenarios.

@Pranav-Wadhwa
Copy link
Contributor

@jorisvandenbossche would updating the to_datetime function to accept a default unit='ns' be a feasible solution for this? Or are there cases where it wouldn't make sense to default to nanoseconds?

@WillAyd
Copy link
Member

WillAyd commented Aug 3, 2024

@Pranav-Wadhwa nanoseconds is what we used previously, so I don't think we want to go back to that. The OP suggests microseconds as a default resolution, although I'm not sure its as simple as changing the to_datetime signature either.

Before diving into the details I think should get some more agreement from the pandas core team. @jbrockmendel is our datetime guru so let's see if he has any thoughts first

@jbrockmendel
Copy link
Member

I’m fine with OP suggestion as long as we are internally consistent, I.e. Timestamp constructor

@Pranav-Wadhwa
Copy link
Contributor

@jbrockmendel what do you mean by timestamp constructor? If we set the default value of unit='us' in to_datetime, it resolves the example that the OP mentioned but would require changing many test cases in tests/tools/test_to_datetime.py.

@rhshadrach
Copy link
Member

I would prefer going even further: don't automatically fallback if us is not suitable - require the user to manually override if they are dealing with dates beyond that range. That said, I'm still in favor of the OP as-is being what I see as a step in the right direction.

@Pranav-Wadhwa - I believe there are many places this would need to change in pandas to be consistent beyond just to_datetime.

@Pranav-Wadhwa
Copy link
Contributor

@Pranav-Wadhwa - I believe there are many places this would need to change in pandas to be consistent beyond just to_datetime.

I believe the scope of this work is outside my knowledge as this is my first issues with pandas. If it's helpful to future assignees, I found that the to_datetime function would default to the appropriate unit if no unit was passed in, but if us was passed in as the unit, it would create an object with dtype ns inside the _to_datetime_with_unit function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Non-Nano datetime64/timedelta64 with non-nanosecond resolution Timestamp pd.Timestamp and associated methods
Projects
None yet
Development

No branches or pull requests

6 participants