Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API/PERF: when to check for mismatched tzs/awareness in array_to_datetime #55779

Open
jbrockmendel opened this issue Oct 31, 2023 · 2 comments
Open
Labels
API Design Constructors Series/DataFrame/Index/pd.array Constructors Datetime Datetime data dtype Performance Memory or execution speed performance

Comments

@jbrockmendel
Copy link
Member

jbrockmendel commented Oct 31, 2023

ts = pd.Timestamp("2016-01-01", tz="UTC")
ts2 = ts.tz_convert("US/Pacific")

>>> pd.to_datetime([ts, ts2])
ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True, at position 1

>>> pd.to_datetime([ts.isoformat(), ts2.isoformat()], format="mixed")
<stdin>:1: FutureWarning: In a future version of pandas, parsing datetimes with mixed time zones will raise an error unless `utc=True`. Please specify `utc=True` to opt in to the new behaviour and silence this warning. To create a `Series` with mixed offsets and `object` dtype, please use `apply` and `datetime.datetime.strptime`
Index([2016-01-01 00:00:00+00:00, 2015-12-31 16:00:00-08:00], dtype='object')

If we pass mixed-tz datetime objects, we do a tz-match check at each step of the loop inside array_to_datetime/array_strptime (specifically in state.process_datetime). If we pass mixed-tz strings, the analogous check happens outside the loop. (per #55693 we currently dont have mixed-type checks)

Eventually these checks should be shared, which means we need to decide on the in-loop or after-loop versions. Three differences for users are

  1. the in-loop version adds the f"at position {i}" to the exception message.
  2. the in-loop version can be handled differently based on errors=coerce/ignore
  3. the in-loop version is in-loop and so presumably incurs a performance penalty

The errors=coerce/ignore part is the API part of the issue (though xref #54467 for deprecating ignore). I think it is very likely that the original intent of coerce was to handle invalid individual items, not invalid combinations of items, so would be OK with the API change that would come with moving this outside the loop.

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member Datetime Datetime data dtype Performance Memory or execution speed performance API Design Constructors Series/DataFrame/Index/pd.array Constructors and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 31, 2023
@MarcoGorelli
Copy link
Member

outside the loop sounds fine

@jbrockmendel
Copy link
Member Author

Discussed on yesterday's dev call, had consensus for do-it-at-the end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Constructors Series/DataFrame/Index/pd.array Constructors Datetime Datetime data dtype Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

2 participants