Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: freq set to NONE when resampling pd.Multiindex (introduced in v1.1.0) #35563

Closed
2 of 3 tasks
andreas-vester opened this issue Aug 5, 2020 · 21 comments · Fixed by #38120
Closed
2 of 3 tasks

BUG: freq set to NONE when resampling pd.Multiindex (introduced in v1.1.0) #35563

andreas-vester opened this issue Aug 5, 2020 · 21 comments · Fixed by #38120
Labels
Frequency DateOffsets Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@andreas-vester
Copy link

andreas-vester commented Aug 5, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Update/Summary

When dealing with a pd.multiindex the frequency df.loc[i].index.freq has not been set. This behavior has been introduced in v1.1.0

idx = pd.Index(range(2), name="A")
dti = pd.date_range("2020-01-01", periods=7, freq="D", name="B")
mi = pd.MultiIndex.from_product([idx, dti])

df = pd.DataFrame(np.random.randn(14, 2), index=mi)

>>> df.loc[0].index

df.loc[0].index matches dti, so it would be nice to get the freq="D" back on that.

using code sample in #35563 (comment) points to #31315

c988567 is the first bad commit
commit c988567
Author: jbrockmendel jbrockmendel@gmail.com
Date: Sun Feb 9 07:01:20 2020 -0800

REF: tighten what we accept in TimedeltaIndex._simple_new (#31315)

Original post

Code Sample, a copy-pastable example

import numpy as np
import pandas as pd

idx_level_0 = np.repeat([1, 2], 5)
dates = np.tile(
    ["2020-01-01", "2020-01-02", "2020-01-04", "2020-01-06", "2020-01-07"], 2
)
values1 = [1, 2, 3, 4, 5]
values2 = [6, 7, 8, 9, 10]

df = pd.DataFrame(
    {"idx_level_0": idx_level_0, "dates": dates, "values": [*values1, *values2]}
)
df["dates"] = pd.to_datetime(df["dates"])
df = df.set_index(["idx_level_0", "dates"], drop=True)

df = df.groupby("idx_level_0").resample("D", level="dates").last()

# The following assertion is working properly in pandas v1.0.5
# It throws an error in pandas v1.1.0
assert df.index.get_level_values(1).freq == "D"

Problem description

When resampling a groupby-object, the frequency will incorrectly be set to None.

Expected Output

The frequency should be set according to the resampled frequency.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : d9fff27
python : 3.8.2.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : de_DE.cp1252
pandas : 1.1.0
numpy : 1.18.4
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 45.3.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.2.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@andreas-vester andreas-vester added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 5, 2020
@TomAugspurger
Copy link
Contributor

I think the output on pandas 1.1.0 (no freq) is correct. This can't have a freq since the values are repeated.

In [14]: df.index.get_level_values(1)
Out[14]:
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-01',
               '2020-01-02', '2020-01-03', '2020-01-04', '2020-01-05',
               '2020-01-06', '2020-01-07'],
              dtype='datetime64[ns]', name='dates', freq=None)

OTOH, df.index.levels[1].freq is correctly <Day>.

@TomAugspurger TomAugspurger added Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member Bug labels Aug 5, 2020
@andreas-vester
Copy link
Author

Well, you're correct in the sense that the values are repeated if you take a look at the whole secondary index. However, they are not repeated when taking into account the multiindex. The freq is D for every group with respect to the first index level.

Pandas v1.0.5 treated this differently (freq is D). I would prefer to see the previous behavior. This (intended) change in pandas breaks code on my existing projects.

@TomAugspurger
Copy link
Contributor

I think the previous behavior is incorrect, since the values don't represent a timeseries with daily frequency. (cc @jbrockmendel for a second opinion).

@andreas-vester
Copy link
Author

I disagree. Here's my real world example:

Consider index level 0 to be equity tickers, index level 1 are trading dates (days, weeks, months, etc.), and the values are closing prices. That is, you have multiple time series below each other. This is a tidy (long) dataframe. Applying the resample method to each groupby object should yield a specific frequency.

I am actually wondering on what grounds the different behavior has been introduced into v1.1.0?

@simonjayhawkins simonjayhawkins added this to the 1.1.1 milestone Aug 6, 2020
@TomAugspurger
Copy link
Contributor

@Kraxelhuber all your points correctly describe why .index.levels[1].freq is Day. But that doesn't apply to the index returned by .get_level_values() which doesn't have a well-defined frequency (not all the elements are equally spaced by some value).

I am actually wondering on what grounds the different behavior has been introduced into v1.1.0?

There were several PRs fixing cases where freq was incorrectly propagated (most or all by @jbrockmendel IIRC), e.g. #30511.

@jbrockmendel
Copy link
Member

@TomAugspurger is right about the distinction between get_level_values().freq and levels[1].freq.

That said, I think there is a behavior that can be improved:

idx = pd.Index(range(2), name="A")
dti = pd.date_range("2020-01-01", periods=7, freq="D", name="B")
mi = pd.MultiIndex.from_product([idx, dti])

df = pd.DataFrame(np.random.randn(14, 2), index=mi)

>>> df.loc[0].index

df.loc[0].index matches dti, so it would be nice to get the freq="D" back on that.

@simonjayhawkins simonjayhawkins added Frequency DateOffsets Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex and removed Needs Info Clarification about behavior needed to assess issue labels Aug 7, 2020
@simonjayhawkins
Copy link
Member

df.loc[0].index matches dti, so it would be nice to get the freq="D" back on that.

can confirm that was the behaviour in 1.0.5

>>> pd.__version__
'1.0.5'
>>>
>>> idx = pd.Index(range(2), name="A")
>>> dti = pd.date_range("2020-01-01", periods=7, freq="D", name="B")
>>> mi = pd.MultiIndex.from_product([idx, dti])
>>>
>>> df = pd.DataFrame(np.random.randn(14, 2), index=mi)
>>>
>>> df.loc[0].index
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07'],
              dtype='datetime64[ns]', name='B', freq='D')
>>>

@simonjayhawkins simonjayhawkins added the Regression Functionality that used to work in a prior pandas version label Aug 7, 2020
@andreas-vester
Copy link
Author

@Kraxelhuber all your points correctly describe why .index.levels[1].freq is Day. But that doesn't apply to the index returned by .get_level_values() which doesn't have a well-defined frequency (not all the elements are equally spaced by some value).

I again reflected on your reasoning regarding .get_level_values(). I agree with you.

Anyway, what I am really after is the fact that every group (in terms of a multiindex) can have its own frequency. Actually, these frequencies may be different or not existent for each individual group. I think this is related to the point made by @jbrockmendel

import numpy as np
import pandas as pd

idx_level_0 = np.repeat([1, 2], 5)
dates = [
    "2020-01-01",
    "2020-01-02",
    "2020-01-03",
    "2020-01-04",
    "2020-01-05",
    "2020-03-01",
    "2020-03-08",
    "2020-03-15",
    "2020-03-22",
    "2020-03-29",
]
values1 = [1, 2, 3, 4, 5]
values2 = [6, 7, 8, 9, 10]

df = pd.DataFrame(
    {"idx_level_0": idx_level_0, "dates": dates, "values": [*values1, *values2]}
)
df["dates"] = pd.to_datetime(df["dates"])
df = df.set_index(["idx_level_0", "dates"], drop=True)

# df.loc[1].index.freq should yield "D"
# df.loc[1].index.freq should yield "W"

@TomAugspurger
Copy link
Contributor

Great, I think we're all agreed on the point that df.loc[i].index.freq should be set, and is a regression. @Kraxelhuber can you edit the original post to reflect that, and that it's a regression from 1.0.5?

@simonjayhawkins
Copy link
Member

using code sample in #35563 (comment) points to #31315

c988567 is the first bad commit
commit c988567
Author: jbrockmendel jbrockmendel@gmail.com
Date: Sun Feb 9 07:01:20 2020 -0800

REF: tighten what we accept in TimedeltaIndex._simple_new (#31315)

@simonjayhawkins simonjayhawkins modified the milestones: 1.1.1, 1.1.2 Aug 20, 2020
@simonjayhawkins simonjayhawkins modified the milestones: 1.1.2, 1.1.3 Sep 7, 2020
@simonjayhawkins
Copy link
Member

moved off 1.1.2 milestone (scheduled for this week) as no PRs to fix in the pipeline

@simonjayhawkins simonjayhawkins modified the milestones: 1.1.3, 1.1.4 Oct 5, 2020
@simonjayhawkins
Copy link
Member

moved off 1.1.3 milestone (overdue) as no PRs to fix in the pipeline

@simonjayhawkins simonjayhawkins modified the milestones: 1.1.4, 1.1.5 Oct 29, 2020
@simonjayhawkins
Copy link
Member

moved off 1.1.4 milestone (scheduled for release tomorrow) as no PRs to fix in the pipeline

@simonjayhawkins
Copy link
Member

it appears that the frequency is lost is in the multiindex construction...

>>> import pandas as pd
>>> pd.__version__
'1.2.0.dev0+1320.g28634289c8'
>>> idx = pd.Index(range(2), name="A")
>>> dti = pd.date_range("2020-01-01", periods=7, freq="D", name="B")
>>> mi = pd.MultiIndex.from_product([idx, dti])
>>> mi.levels[1]
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07'],
              dtype='datetime64[ns]', name='B', freq=None)
>>> import pandas as pd
>>> pd.__version__
'1.0.5'
>>> idx = pd.Index(range(2), name="A")
>>> dti = pd.date_range("2020-01-01", periods=7, freq="D", name="B")
>>> mi = pd.MultiIndex.from_product([idx, dti])
>>> mi.levels[1]
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07'],
              dtype='datetime64[ns]', name='B', freq='D')

it is actually lost in the _shallow_copy where changes were made in #31315

the values passed come from factorization.

since there have been several refactors to _shallow_copy/_simple_new since, i'm not sure that we want to (or can easily) revert those changes

so it may be easier to ensure that the freq is retained in the factorisation step. @jbrockmendel thoughts?

> c:\users\simon\pandas\pandas\core\indexes\extension.py(221)_shallow_copy()
-> name = self.name if name is lib.no_default else name
(Pdb) a
self = DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07'],
              dtype='datetime64[ns]', name='B', freq='D')
values = <DatetimeArray>
['2020-01-01 00:00:00', '2020-01-02 00:00:00', '2020-01-03 00:00:00',
 '2020-01-04 00:00:00', '2020-01-05 00:00:00', '2020-01-06 00:00:00',
 '2020-01-07 00:00:00']
Length: 7, dtype: datetime64[ns]
name = None
(Pdb) values.freq
(Pdb)

@TomAugspurger
Copy link
Contributor

Retaining the freq through factorize makes sense to me... IIUC, if the input index had a freq, the output index will have an identical freq, right? Since for the input to have a freq it must be unique and monotonic.

@jbrockmendel
Copy link
Member

That's the crux of #33830. I think retaining freq makes sense.

@simonjayhawkins
Copy link
Member

@jbrockmendel do you have time to look into this? I won't be investigating further today.

@jbrockmendel
Copy link
Member

strong maybe

@phofl
Copy link
Member

phofl commented Nov 28, 2020

@jbrockmendel I have a question losely related to the ops question:

dti = date_range(start="1/1/2005", end="12/1/2005", freq="M")
dti2 = dti[[1, 3, 5]]
dti3 = dti[slice(1, 5, 2)]

The first one does not retain a freq while the second one does. This seems odd to me, because both are doing exactly the same from a user facing perspective. If we get a listlike indexer, which is essentially equal to a slice, should we retain the freq then? Could implement this relatively easy in the part where we are inferring the freq for slice like indexers

@jbrockmendel
Copy link
Member

If we get a listlike indexer, which is essentially equal to a slice

The difference is that when we do dti[some_slice] we get a view on the underlying data, whereas dti[listlike] always makes a copy. Not necessarily a major user-facing difference, but it is non-trivially more performant.

You're right it wouldn't be that hard to do a lib.maybe_indices_to_slice call in DatetimeLikeArrayMixin._get_getitem_freq, but the downside is it would hurt performance. I don't care that much, so if you feel strongly about it, go ahead and open an issue.

@phofl
Copy link
Member

phofl commented Nov 29, 2020

Thx for the explanation. Do not feel strongly about it, just stumbled across when working on #27180 and wondered about the different behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Frequency DateOffsets Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants