Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Add support for monthly and yearly data in TimeGapSplit #190

Open
rubenvdg opened this issue Sep 19, 2019 · 9 comments
Open

[FEATURE] Add support for monthly and yearly data in TimeGapSplit #190

rubenvdg opened this issue Sep 19, 2019 · 9 comments
Labels
enhancement New feature or request

Comments

@rubenvdg
Copy link

First of all: TimeGapSplit is a super useful feature!

Unfortunately, if I'm not mistaken, TimeGapSplit currently does not support monthly or yearly data. This follows from the design choice to use timedelta to construct the train and validation sets (timedelta does not support months or years). For example:

df = (
    pd.DataFrame(
        data=np.random.randint(0, 30, size=(30, 4)),
        columns=list('ABCy')
    )
    .assign(
        date=pd.date_range(start='1/1/2018', end='1/30/2018')[::-1]
    )
)

cv = TimeGapSplit(
    df=df,
    date_col='date',
    train_duration=timedelta(months=1),
    valid_duration=timedelta(months=1),
)

raises TypeError: 'months' is an invalid keyword argument for __new__().
This could maybe be fixed by using pd.DateOffset over timedelta.

Tomorrow, I probably have time to look into this, so any guidance or feedback would be very much appreciated @kayhoogland @stephanecollot.

@rubenvdg rubenvdg added the enhancement New feature or request label Sep 19, 2019
@stephanecollot
Copy link
Contributor

stephanecollot commented Sep 19, 2019 via email

@pim-hoeven
Copy link
Contributor

@koaning I think I have something similarly laying around somewhere (quarters in that case), I can add it here if it's still relevant?

@koaning
Copy link
Owner

koaning commented Aug 12, 2020

I'd say so! If you add quarters though, feel free to also add months/years. I think pandas also has a special syntax for timeslots (that thing they use in .resample() that might be worth to check out.

@pim-hoeven
Copy link
Contributor

Alright! The current version does something like quarters = pd.PeriodIndex(dates, freq="Q"), so easily extensible to other frequencies.

In my version I decided to opt for passing the dates to .split instead of __init__, because it feels a bit obsolete to store the dates on the class. I'll first try to fix something that passes the current tests, but if you agree I'll deprecate the date_series in the __init__.

@koaning
Copy link
Owner

koaning commented Aug 12, 2020

I recall there being a good reason for the current design on that front. I also think this is a feature folks tend to use a lot so I wouldn't want to deprecate it without giving it a good pause and consideration.

Could you describe the api you've got in mind here? Using python-pseudo-code?

@pim-hoeven
Copy link
Contributor

Probably because that way it's easier to use in GridSearch?

I think the biggest difference is that the current approach can do:

GridSearchCV(
    ...
    cv=TimeGapSplit(dates, ...)
)

whereas my approach would be called like this:

GridSearchCV(
    ...
    cv=TimeGapSplit(...).split(dates)
)

where

class TimeCapSplit:
    def __init__(self, freq, periods_train, periods_gap, periods_validate):
        ...

    def split(dates):
        ...

In that sense the first approach is more similar to the sklearn API (where the split method of cross validator classes always has X, y=None, groups=None as arguments.

I would understand if you want to keep it as close to the sklearn API as possible, so let me know!

@pim-hoeven
Copy link
Contributor

On another note: I noticed that the current version only supports dataframes as input, so I'll fix that as well

@stephanecollot
Copy link
Contributor

In that sense the first approach is more similar to the sklearn API

Yes it is better to be able to give a CV splitter class. Plus there is get_n_splits().

@pim-hoeven
Copy link
Contributor

Dates in init vs split
Alright, I understand the consideration. It still feels a bit weird for a number of reasons, e.g. supplying X to split is a bit redundant (all information we need is in the dates, hypothetically we could even define the splits in the init) and we would still have a potentially very large object stored on the class. I agree with keeping it close to the sklearn API, but it might be that the sklearn API is a bit to strict for us in this case.

Another solution might be to force X to have date column (for which the name can be supplied in the init), that way we would still adhere to the sklearn API. However, then we would need to drop the date column in the pipeline, which would also be a bit messy.

Questions about design
@stephanecollot
Maybe you can help me understand these design decisions. First two are general interest :)

  1. What is the rationale for shifting the folds further in time is n_splits is specified?
# if the n_splits is smaller than what would usually be done for train val and gap duration,
# the next fold is slightly further in time than just valid_duration
if self.n_splits is not None:
    time_shift = self.valid_duration * n_split_max / self.n_splits
else:
    time_shift = self.valid_duration
  1. Is there a specific reason to start at the beginning instead of the end? If the splits don't exactly cover all dates, then the last observations are not used (instead of the first)
  2. I haven't checked thoroughly, but I think that that all splits are currently made relative to the time of the first observation. I.e. if the time of the first date is 13.00, all splits will start at 13.00. I'm not sure whether users expect this behavior, especially in context of months, quarters or years (but I'm biased because I implemented it the other way, where observations on the same day are always in the same split).
  3. In one of the tests, the assertion is that the last train date is greater or equal to the first validation date. I would argue that a strict inequality makes more sense.

Maybe a new splitter is better?
@koaning maybe a better approach to this issue (especially given 3 and 4) is to have a different splitter (DateTimeSplit for example) where

  • The user can specify a frequency (e.g. years, months, days, similar to the pandas API)
  • The user can specify a train, gap and validation size in ints
  • Splits are made on 'frequency-value', i.e. if user specifies months, than month-end / begin is the place where the split is made.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants