Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR offsets: rename 'M' to 'ME' #52064

Conversation

natmokval
Copy link
Contributor

@natmokval natmokval commented Mar 18, 2023

This pr is related to issue #9586.

Added warning to to_offset(freq) in case freq='M'. This is the first step to deprecate 'M' for MonthEnd.
TODO: let 'M' keep working for Period

example to reproduce:

for orig_freq, target_freq in [('10D', '1M'), ('3M', 'Y')]:
    print(f"{orig_freq}->{target_freq}")
    ind = pd.date_range('2022-01-01', freq=orig_freq, periods=5)
    s = pd.Series(np.arange(5), ind)
    print(s) 
    print(s.resample(target_freq, closed='right').asfreq())

both date_range and resample produce warnings.

pandas/core/resample.py:1604: UserWarning: 'M' will be deprecated, please use 'ME' for 'month end'
  freq = to_offset(freq)

@natmokval
Copy link
Contributor Author

The _prefix in the class MonthEnd is redefined from “M” to "ME”.
In to_offset we check, if “M” is passed, then raise a warning and change “M” to “ME”.

Now we can apply both, “M” and “ME”. The example below raises one warning for freq=‘3M’

for orig_freq, target_freq in [('10D', '1ME'), ('3M', 'Y')]:
    print(f"{orig_freq}->{target_freq}")
    ind = pd.date_range('2022-01-01', freq=orig_freq, periods=5)
    s = pd.Series(np.arange(5), ind)
    print(s) 
    print(s.resample(target_freq, closed='right').asfreq())

In order to reduce the number of test errors I also changed “M” to “ME” in pandas/tests/arrays/test_datetimelike.py:35

@pytest.fixture(params=["D", "B", "W", "ME", "Q", "Y"])

This reduced number of test errors from 1400 to 68.

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for having started this!

So, this correctly throws a warning for date_range and resample with 'ME'

However, it also throws a warning for PeriodIndex(..., freq='M'), which as per #9586 (comment) we need to keep working

So, we need to figure out how to pass 'M' to PeriodIndex without it warning, but such that passing 'M' to date_range would warn

@MarcoGorelli
Copy link
Member

Maybe to_offset could have a is_period boolean argument, which would be passed as True when called when instantiating a Period, i.e.

freq = to_offset(freq) # type: ignore[assignment]

And then, instead of warning if name == 'M', if you warn if name == 'M' and not is_period

@natmokval
Copy link
Contributor Author

However, it also throws a warning for PeriodIndex(..., freq='M'), which as per #9586 (comment) we need to keep working

So, we need to figure out how to pass 'M' to PeriodIndex without it warning, but such that passing 'M' to date_range would warn

Thanks for helping me with the issue.

I noticed the same inconsistency for Period. It also throws a warning for Period(..., freq='M'):

>>> pd.Period('2017Q3', freq='M')
Period('2017-07', 'ME')

I will work on fixing it.

@MarcoGorelli
Copy link
Member

we might need to update

def __repr__(self) -> str:
base = self._dtype._dtype_code
formatted = period_format(self.ordinal, base)
return f"Period('{formatted}', '{self.freqstr}')"

so that, when it's printed, it still shows 'M'

pandas/_libs/tslibs/offsets.pyx Outdated Show resolved Hide resolved
pandas/conftest.py Outdated Show resolved Hide resolved
pandas/core/indexes/period.py Outdated Show resolved Hide resolved
@natmokval natmokval force-pushed the 9586-inconsistent-labeling-sub-daily-super-daily-frequencies branch from b07c569 to e1d82cb Compare March 30, 2023 21:44
@MarcoGorelli MarcoGorelli changed the title Frequency: raise warnings when using ‘M’ frequency DEPR offsets: rename 'M' to 'ME' Mar 31, 2023
@MarcoGorelli
Copy link
Member

Looks like this one might need updating

# Map attribute-name resolutions to resolution abbreviations
_attrname_to_abbrevs = {
"year": "A",
"quarter": "Q",
"month": "M",
"day": "D",
"hour": "H",
"minute": "T",
"second": "S",
"millisecond": "L",
"microsecond": "U",
"nanosecond": "N",
}

@natmokval
Copy link
Contributor Author

Looks like this one might need updating

# Map attribute-name resolutions to resolution abbreviations
_attrname_to_abbrevs = {
"year": "A",
"quarter": "Q",
"month": "M",
"day": "D",
"hour": "H",
"minute": "T",
"second": "S",
"millisecond": "L",
"microsecond": "U",
"nanosecond": "N",
}

Thank you, I updated _attrname_to_abbrevs as you suggested and

"M": PeriodDtypeCode.M, # Monthly

and
cpdef str npy_unit_to_abbrev(NPY_DATETIMEUNIT unit):
if unit == NPY_DATETIMEUNIT.NPY_FR_ns or unit == NPY_DATETIMEUNIT.NPY_FR_GENERIC:
# generic -> default to nanoseconds
return "ns"
elif unit == NPY_DATETIMEUNIT.NPY_FR_us:
return "us"
elif unit == NPY_DATETIMEUNIT.NPY_FR_ms:
return "ms"
elif unit == NPY_DATETIMEUNIT.NPY_FR_s:
return "s"
elif unit == NPY_DATETIMEUNIT.NPY_FR_m:
return "m"
elif unit == NPY_DATETIMEUNIT.NPY_FR_h:
return "h"
elif unit == NPY_DATETIMEUNIT.NPY_FR_D:
return "D"
elif unit == NPY_DATETIMEUNIT.NPY_FR_W:
return "W"
elif unit == NPY_DATETIMEUNIT.NPY_FR_M:
return "M"

and
cpdef NPY_DATETIMEUNIT abbrev_to_npy_unit(str abbrev):
if abbrev == "Y":
return NPY_DATETIMEUNIT.NPY_FR_Y
elif abbrev == "M":

as well
and it fixed my tests.

but I still have failures in some other tests and I keep working on fixing them.

@MarcoGorelli
Copy link
Member

thanks @natmokval - looks like CI didn't run for some reason, could you fetch and merge upstream/main please so I can take a look at any test failures? or just let me know which tests are failing for you and I'll try running them

@natmokval natmokval marked this pull request as ready for review April 5, 2023 20:34
@natmokval
Copy link
Contributor Author

thanks @natmokval - looks like CI didn't run for some reason, could you fetch and merge upstream/main please so I can take a look at any test failures? or just let me know which tests are failing for you and I'll try running them

Because I didn't want to run CI, I added '[skip ci]' to the commit message. Maybe it's better to run CI for every commit.

I did fetch, merge main, and made a new commit. Could you please take a look at the tests:
pandas/tests/indexes/period/test_indexing.py
and
pandas/tests/groupby/test_timegrouper.py

@MarcoGorelli
Copy link
Member

You'll need to update

end_types = {"M", "A", "Q", "BM", "BA", "BQ", "W"}

too

For how I found this: the failing test has df.groupby(Grouper(freq="6M")). So I looked at the definition of Grouper, which led to TimeGrouper, which led to that line

warnings.warn(
"\'M\' will be deprecated, please use \'ME\' "
"for \'month end\'",
UserWarning,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I'd missed this when reviewing, this should have been FutureWarning

@rhshadrach
Copy link
Member

This patch may have induced a performance regression. If it was a necessary behavior change, this may have been expected and everything is okay.

Please check the links below. If any ASVs are parameterized, the combinations of parameters that a regression has been detected for appear as subbullets.

Subsequent benchmarks may have skipped some commits. The link below lists the commits that are between the two benchmark runs where the regression was identified.

61a6335...a98be06

@spencerkclark
Copy link
Contributor

Oh nice, thanks, I hadn't noticed that! Yes, that fixes most of the failures. The remaining ones seem to be a little more subtle; I will try and isolate how they relate to pandas when I get a chance.

@MarcoGorelli sorry for taking a while to get back on this — I have one more kind of example that seems to have been tripped up by this PR (note the presence of the offset parameter in this example):

Prior to this PR

>>> import numpy as np; import pandas as pd
>>> index = pd.date_range("2000", periods=5, freq="5D")
>>> series = pd.Series(np.arange(index.size), index=index)
>>> series.resample("2D", closed="right", label="right", offset="1s").mean()
2000-01-01 00:00:01    0.0
2000-01-03 00:00:01    NaN
2000-01-05 00:00:01    1.0
2000-01-07 00:00:01    NaN
2000-01-09 00:00:01    NaN
2000-01-11 00:00:01    2.0
2000-01-13 00:00:01    NaN
2000-01-15 00:00:01    3.0
2000-01-17 00:00:01    NaN
2000-01-19 00:00:01    NaN
2000-01-21 00:00:01    4.0
Freq: 2D, dtype: float64

On main as of 2023-10-29

>>> import numpy as np; import pandas as pd
>>> index = pd.date_range("2000", periods=5, freq="5D")
>>> series = pd.Series(np.arange(index.size), index=index)
>>> series.resample("2D", closed="right", label="right", offset="1s").mean()
2000-01-01 00:00:01    0.0
2000-01-03 00:00:01    NaN
2000-01-05 00:00:01    NaN
2000-01-07 00:00:01    1.0
2000-01-09 00:00:01    NaN
2000-01-11 00:00:01    2.0
2000-01-13 00:00:01    NaN
2000-01-15 00:00:01    NaN
2000-01-17 00:00:01    3.0
2000-01-19 00:00:01    NaN
2000-01-21 00:00:01    4.0
Freq: 2D, dtype: float64

Thanks again for your help!

@MarcoGorelli
Copy link
Member

Hey @spencerkclark

thanks again for the report - you trying things out on main so we can fix them before they reach users really makes my day!

So, let's have a look at what's going on. We've put closed='right' and label='right', so I'd expect the windows to look like this:

  • 2000-01-01 00:00:01: (1999-12-30 00:00:01, 2000-01-01 00:00:01]. [0]
  • 2000-01-03 00:00:01: (2000-01-01 00:00:01, 2000-01-03 00:00:01]. []
  • 2000-01-05 00:00:01: (2000-01-03 00:00:01, 2000-01-05 00:00:01]. []
  • 2000-01-07 00:00:01: (2000-01-05 00:00:01, 2000-01-07 00:00:01]. [1]
  • 2000-01-09 00:00:01: (2000-01-07 00:00:01, 2000-01-09 00:00:01]. []
  • 2000-01-11 00:00:01: (2000-01-09 00:00:01, 2000-01-11 00:00:01]. [2]
  • 2000-01-13 00:00:01: (2000-01-11 00:00:01, 2000-01-13 00:00:01]. []
  • ...

Looks like this matches the current output, and that the previous one was incorrect?

Furthermore, Polars agrees with the current pandas output:

In [31]: dfpl = pl.from_pandas(series.to_frame().reset_index().rename(columns={0: 'value'}))

In [32]: dfpl.sort('index').group_by_dynamic(
    ...:     'index',
    ...:     every='2d',
    ...:     offset=timedelta(seconds=1)-timedelta(days=1),
    ...:     closed='right',
    ...:     label='right').agg(pl.col('value'))
Out[32]:
shape: (5, 2)
┌─────────────────────┬───────────┐
│ indexvalue     │
│ ------       │
│ datetime[ns]        ┆ list[i64] │
╞═════════════════════╪═══════════╡
│ 2000-01-01 00:00:01 ┆ [0]       │
│ 2000-01-07 00:00:01 ┆ [1]       │
│ 2000-01-11 00:00:01 ┆ [2]       │
│ 2000-01-17 00:00:01 ┆ [3]       │
│ 2000-01-21 00:00:01 ┆ [4]       │
└─────────────────────┴───────────┘

Anyway, I think was most likely fixed by #55283, not by this PR

@spencerkclark
Copy link
Contributor

Ah this all makes sense now — I hadn't fully appreciated that #55283 also provided a bug fix and didn't think to question the original output. Thanks for that detailed explanation. Indeed #55283 is what I should be looking at and provides an illustration for how we should fix this in our implementation in xarray — I can confirm locally that the previously failing tests pass!

@@ -2729,6 +2743,9 @@ def asfreq(
if how is None:
how = "E"

if isinstance(freq, BaseOffset):
freq = freq_to_period_freqstr(freq.n, freq.name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think if a user tries to use a period-invalid freq with a PeriodIndex we would want to raise/warn here wouldn't we?

Copy link
Member

@MarcoGorelli MarcoGorelli Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it already raises doesn't it?

In [1]: In [10]: df = pd.DataFrame({'ts': pd.period_range('2000', periods=3), 'val': [1,2,3]})
   ...:
   ...: In [11]: df.set_index('ts').resample('MS').mean()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
File period.pyx:2854, in pandas._libs.tslibs.period.freq_to_dtype_code()

AttributeError: 'pandas._libs.tslibs.offsets.MonthBegin' object has no attribute '_period_dtype_code'

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
Cell In[1], line 3
      1 df = pd.DataFrame({'ts': pd.period_range('2000', periods=3), 'val': [1,2,3]})
----> 3 df.set_index('ts').resample('MS').mean()

File ~/tmp/.venv/lib/python3.10/site-packages/pandas/core/generic.py:9788, in NDFrame.resample(self, rule, axis, closed, label, convention, kind, on, level, origin, offset, group_keys)
   9785 else:
   9786     kind = None
-> 9788 return get_resampler(
   9789     cast("Series | DataFrame", self),
   9790     freq=rule,
   9791     label=label,
   9792     closed=closed,
   9793     axis=axis,
   9794     kind=kind,
   9795     convention=convention,
   9796     key=on,
   9797     level=level,
   9798     origin=origin,
   9799     offset=offset,
   9800     group_keys=group_keys,
   9801 )

File ~/tmp/.venv/lib/python3.10/site-packages/pandas/core/resample.py:2038, in get_resampler(obj, kind, **kwds)
   2034 """
   2035 Create a TimeGrouper and return our resampler.
   2036 """
   2037 tg = TimeGrouper(obj, **kwds)  # type: ignore[arg-type]
-> 2038 return tg._get_resampler(obj, kind=kind)

File ~/tmp/.venv/lib/python3.10/site-packages/pandas/core/resample.py:2227, in TimeGrouper._get_resampler(self, obj, kind)
   2218     return DatetimeIndexResampler(
   2219         obj,
   2220         timegrouper=self,
   (...)
   2224         gpr_index=ax,
   2225     )
   2226 elif isinstance(ax, PeriodIndex) or kind == "period":
-> 2227     return PeriodIndexResampler(
   2228         obj,
   2229         timegrouper=self,
   2230         kind=kind,
   2231         axis=self.axis,
   2232         group_keys=self.group_keys,
   2233         gpr_index=ax,
   2234     )
   2235 elif isinstance(ax, TimedeltaIndex):
   2236     return TimedeltaIndexResampler(
   2237         obj,
   2238         timegrouper=self,
   (...)
   2241         gpr_index=ax,
   2242     )

File ~/tmp/.venv/lib/python3.10/site-packages/pandas/core/resample.py:185, in Resampler.__init__(self, obj, timegrouper, axis, kind, gpr_index, group_keys, selection, include_groups)
    180 self.include_groups = include_groups
    182 self.obj, self.ax, self._indexer = self._timegrouper._set_grouper(
    183     self._convert_obj(obj), sort=True, gpr_index=gpr_index
    184 )
--> 185 self.binner, self.grouper = self._get_binner()
    186 self._selection = selection
    187 if self._timegrouper.key is not None:

File ~/tmp/.venv/lib/python3.10/site-packages/pandas/core/resample.py:250, in Resampler._get_binner(self)
    244 @final
    245 def _get_binner(self):
    246     """
    247     Create the BinGrouper, assume that self.set_grouper(obj)
    248     has already been called.
    249     """
--> 250     binner, bins, binlabels = self._get_binner_for_time()
    251     assert len(bins) == len(binlabels)
    252     bin_grouper = BinGrouper(bins, binlabels, indexer=self._indexer)

File ~/tmp/.venv/lib/python3.10/site-packages/pandas/core/resample.py:1884, in PeriodIndexResampler._get_binner_for_time(self)
   1882 if self.kind == "timestamp":
   1883     return super()._get_binner_for_time()
-> 1884 return self._timegrouper._get_period_bins(self.ax)

File ~/tmp/.venv/lib/python3.10/site-packages/pandas/core/resample.py:2451, in TimeGrouper._get_period_bins(self, ax)
   2447     return binner, bins, labels
   2449 freq_mult = self.freq.n
-> 2451 start = ax.min().asfreq(self.freq, how=self.convention)
   2452 end = ax.max().asfreq(self.freq, how="end")
   2453 bin_shift = 0

File period.pyx:1945, in pandas._libs.tslibs.period._Period.asfreq()

File period.pyx:2856, in pandas._libs.tslibs.period.freq_to_dtype_code()

ValueError: Invalid frequency: <MonthBegin>


if isinstance(index, PeriodIndex):
orig_freq = to_offset(index.freq)
if freq != orig_freq:
assert orig_freq is not None # for mypy
raise ValueError(
f"Given freq {freq.rule_code} does not match "
f"PeriodIndex freq {orig_freq.rule_code}"
f"Given freq {freq_to_period_freqstr(freq.n, freq.name)} "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid the freq_to_period_freqstr by using is_period=True in to_offset a few lines up?

}
cdef dict c_OFFSET_TO_PERIOD_FREQSTR = OFFSET_TO_PERIOD_FREQSTR

cpdef freq_to_period_freqstr(freq_n, freq_name):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be viable to get away from this entirely and use PeriodDtype(Base)._freqstr?

tacaswell added a commit to tacaswell/pandas that referenced this pull request May 29, 2024
The PRs pandas-dev#55792 (Y), pandas-dev#52064 (Q), and pandas-dev#55553 (M) deprecated the single letter
version of the aliases in favour of the -end version of them.

This table was missed as part of that work.
tacaswell added a commit to tacaswell/pandas that referenced this pull request May 29, 2024
The PRs pandas-dev#55792 (Y), pandas-dev#52064 (Q), and pandas-dev#55553 (M) deprecated the single letter
version of the aliases in favour of the -end version of them.

Add a note to the offset table about deprecations.
MarcoGorelli pushed a commit that referenced this pull request May 31, 2024
DOC: Add note about alias deprecations

The PRs #55792 (Y), #52064 (Q), and #55553 (M) deprecated the single letter
version of the aliases in favour of the -end version of them.

Add a note to the offset table about deprecations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas Frequency DateOffsets
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants