Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/API: resolution inference in vectorized datetime parsing #55564

Closed
5 of 6 tasks
jbrockmendel opened this issue Oct 17, 2023 · 10 comments · Fixed by #55901
Closed
5 of 6 tasks

ENH/API: resolution inference in vectorized datetime parsing #55564

jbrockmendel opened this issue Oct 17, 2023 · 10 comments · Fixed by #55901
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Datetime Datetime data dtype Enhancement Non-Nano datetime64/timedelta64 with non-nanosecond resolution

Comments

@jbrockmendel
Copy link
Member

jbrockmendel commented Oct 17, 2023

I'm in the process of implementing resolution inference for vectorized datetime parsing in array_to_datetime. This issue is to track and discuss design issues.

  1. Should we implement this as a breaking change in 3.0 or bugfix in 2.x? We changed the scalar behavior in 2.0.
  2. What to do when mixed resolutions are detected? e.g. pd.to_datetime(["2016-01-01", "2016-01-01T02:03:04.050607"])? ATM in the branch I have going I get the resolution from the first non-NaT entry and apply that everywhere.
    i) Are we OK with that value-dependent behavior?
    ii) What if we see a np.datetime64("nat", "s") i.e. it has a reso attached. should we infer "s" from that?
  3. What resolution to infer for "today" or "now"? ATM the Timestamp constructor gives "ns", but bc it goes through the stdlib datetime.now, I'm thinking "us" might be more appropriate. (i also don't care that much)
  4. ATM i only handle array_to_datetime, also need to handle array_strptime.
  5. Need to do the same for the timedelta64 paths.

ATM 1455 tests are failing locally. Hopefully these are mostly bc they hard-code ns in "expected".

cc @MarcoGorelli

Issues that I think implementing this will address

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 17, 2023
@mroeschke
Copy link
Member

mroeschke commented Oct 17, 2023

Initial responses:

  1. I would say bug fix in 2.x. I think a few bug reports are due to discrepancy in behaviors from the scalar vs vectorized cases.

  2. I would assume the least breaking solution is to apply the highest inferred resolution value to all values.
    2ii. I would assume "s" would be picked up from np.datetime("nat", "s")

  3. Agree with "us"

@jbrockmendel
Copy link
Member Author

highest inferred resolution

This is reasonable, but has some downsides that we need to be clear on:

  1. It can require a 2-pass parsing. Iterate over the values once to find the resolution, then again to parse with that resolution. In cases with uniform resolution we can avoid the second pass.
  2. In cases with e.g. ["2999-01-01", "1970-01-01T02:03:04.000000000"] the reso inferred from the second entry will be "ns" which will cause us to raise OutOfBoundsDatetime on the first entry. But since the tail of the second entry is all zeros, we could use "us" and avoid either raising or rounding. This is a pretty weird corner case though.

@jbrockmendel
Copy link
Member Author

Current branch uses highest-encountered reso. Two options for how to do that (current uses the second)

  1. Iterate once to find the highest resolution, then again to store the result with that resolution
  2. Parse with the first encountered resolution. If a higher one is encountered, start over with that resolution. Downside here is we could potentially have to parse up to 4 times if we encounter s, ms, us, ns in that order. Upside is in many cases we will only have to iterate once.

@mroeschke
Copy link
Member

Is resolution inference expensive? While 2) gives a 1 time parse opportunity, the predictable performance profile of 1) is appealing.

@jbrockmendel
Copy link
Member Author

Is resolution inference expensive?

depends on what kind of objects we're dealing with. For strings we have to parse them in order to get the resolution, so we'd be looking at pretty much doubling the current cost. (could probably get single-pass for homogeneous cases).

ATM i still have 230 tests failing locally. Most of these boil down to updating expected which im slowly whittling down, but some of them are surfacing real bugs. Then we'll need to make some API decisions, e.g. ATM on the branch read_excel is producing dt64[s] for .ods files and dt64[us] for everything else (vs dt64[ns] in all cases on main); are we OK with that?

@mroeschke
Copy link
Member

Then we'll need to make some API decisions, e.g. ATM on the branch read_excel is producing dt64[s] for .ods files and dt64[us] for everything else (vs dt64[ns] in all cases on main); are we OK with that?

I'm OK with those difference if there's some inherent resolution specific limitation to an excel format

@MarcoGorelli
Copy link
Member

Should we implement this as a breaking change in 3.0 or bugfix in 2.x? We changed the scalar behavior in 2.0.

3.0 break sounds fine

What to do when mixed resolutions are detected? e.g. pd.to_datetime(["2016-01-01", "2016-01-01T02:03:04.050607"])? ATM in the branch I have going I get the resolution from the first non-NaT entry and apply that everywhere.

I think either the following should be fine:

  • take the resolution from the first non-nat element. for stdlib interpret as 'us', for np.datetime64 take the element's one. let the user overwrite the resolution with some argument (unit?)
  • regardless of values, default to 'us', but let the user overwrite it with unit

Regarding the second one, you said there were some implementation details getting in the way of this - reckon they're surmountable?

If so, then speaking totally theoretically, I think this is what I'd prefer. Auto-infer for scalars, but don't infer for arrays

But, no strong preference really. Even a double-pass wouldn't be terrible, would be OK with option 1 you've described

@jbrockmendel
Copy link
Member Author

@bashtage the branch i have for this currently breaks a bunch of stata tests where resolution no longer round-trips. Do you know what resolution(s) stata stores datetimes in? Thoughts on the desired behavior?

@bashtage
Copy link
Contributor

Stata uses a wide variety of resolutions.

This is the key function:

def _stata_elapsed_date_to_datetime_vec(dates: Series, fmt: str) -> Series:

There is a note that once other resolutions are implemented, other than ns, that it would make sense to use the same format as Stata does if possible, e.g., a years date time series would be returned as s yearly pandas datetime.

It has

  • subsecond (milli?), code tc
  • days, code td
  • week of the year (tw)
  • month (tm)
  • quarter (tq)
  • half years (th)
  • year (ty)

@jbrockmendel jbrockmendel added Constructors Series/DataFrame/Index/pd.array Constructors Non-Nano datetime64/timedelta64 with non-nanosecond resolution Enhancement and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 1, 2023
@jbrockmendel
Copy link
Member Author

Status Update:

I have a branch that implements resolution inference in DatetimeIndex and to_datetime. ATM 63 tests are failing, down from an initial estimate of "zillions".

Many tests that currently hard-code "ns" in expected are updated to hard-code a new resolution. In some cases this is fine, in others a better solution would be to parametrize over units.

Still surfacing new bugs which I'm addressing as they come up.

Most of the remaining failing tests are in the io tests. This includes 11 for test_stata that may be pre-empted by #55642 and a handful for SAS that can potentially be pre-empted by changing return types to non-nao to better match the SAS format (though see #28047 (comment)). ATM JSON cases look most liable to be problematic to users; will put an example up soonish.

POCs for the cython-level implementation #55741 and #55778.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Datetime Datetime data dtype Enhancement Non-Nano datetime64/timedelta64 with non-nanosecond resolution
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants