Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

frequencies: Number of pivot points differs for same date range when both start/end date are inclusive #963

Closed
victorlin opened this issue Jun 1, 2022 · 3 comments · Fixed by #1121 or #1150
Assignees
Labels
bug Something isn't working

Comments

@victorlin
Copy link
Member

Current Behavior

Setup:

from augur.frequency_estimators import get_pivots
from augur.dates import numeric_date

Getting monthly pivots between 2022-01-02 and 2022-03-02 gives 2 pivots:

start_date = numeric_date('2022-01-02')
end_date = numeric_date('2022-03-02')
get_pivots(observations=[], start_date=start_date, end_date=end_date, pivot_interval=1, pivot_interval_units="months")
# out: array([2022.0833, 2022.1667])
# == [2022-02-01, 2022-03-01]

However, shifting back one day to 2022-01-01 and 2022-03-01 gives 3 pivots:

start_date = numeric_date('2022-01-01')
end_date = numeric_date('2022-03-01')
get_pivots(observations=[], start_date=start_date, end_date=end_date, pivot_interval=1, pivot_interval_units="months")
# out: array([2022.    , 2022.0833, 2022.1667])
# == [2022-01-01, 2022-02-01, 2022-03-01]

Expected behavior

The same date range should produce the same number of pivots (i.e. 2022-01-01 and 2022-03-01 should still give 2 pivots).

Explanation

This is due to 2 behaviors in current usage of pd.date_range():

  1. The pivot points are fixed at the start of every month:

    if pivot_interval_units == "months":
    offset = "%sMS" % pivot_interval

  2. The default parameter of closed=None to pd.date_range() means the interval is inclusive of both start_date and end_date.

It seems like this behavior is intentional, given the test here:

def test_get_pivots_by_months():
"""Get pivots where intervals are defined by months.
"""
pivots = get_pivots(observations=[], pivot_interval=1, start_date=2015.0, end_date=2016.0, pivot_interval_units="months")
# Pivots should include all 12 months of the year plus the month represented
# by the end date, since the pandas month interval uses "month starts". See
# pandas date offsets documentation for more details:
# https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
assert len(pivots) == 13

Possible solution

In the call to pd.date_range()

datetime_pivots = pd.date_range(
float_to_datestring(pivot_start),
float_to_datestring(pivot_end),
freq = offset

specify either closed='left' or closed='right' so the range is inclusive on only one side. However, I'm not sure which is more meaningful for the application here. It also breaks a few existing tests, see failing tests on victorlin@1487790.

@huddlej what do you think of this approach?

Your environment: if running Nextstrain locally

  • Version (e.g. auspice 2.7.0): Augur 15.0.2
@victorlin victorlin added the bug Something isn't working label Jun 1, 2022
huddlej added a commit that referenced this issue Jan 8, 2023
Updates the specific values of expected frequencies for functional
tests, since the corrected pivot calculation logic produces new pivots
and correspondingly new frequency values per pivot. One nice side effect
of the new pivot calculations is that they produce the correct number of
pivots in our functional test of relative min and max date values that
had previously failed in a flaky manner. The corrected pivot logic
reveals that the "flaky" behavior of that test was actually the correct
behavior: we should always get 3 pivots when the max date is 6 months
ago, the min date is 12 months ago, and the pivot interval is 3 months.
That we previously only got 2 pivots from these relative dates most
times of the year reflects the invalid pivot calculation logic from a
different perspective. This commit reactivates that previously
commented-out test.

Fixes #963
huddlej added a commit that referenced this issue Jan 9, 2023
Updates the specific values of expected frequencies for functional
tests, since the corrected pivot calculation logic produces new pivots
and correspondingly new frequency values per pivot. One nice side effect
of the new pivot calculations is that they produce the correct number of
pivots in our functional test of relative min and max date values that
had previously failed in a flaky manner. The corrected pivot logic
reveals that the "flaky" behavior of that test was actually the correct
behavior: we should always get 3 pivots when the max date is 6 months
ago, the min date is 12 months ago, and the pivot interval is 3 months.
That we previously only got 2 pivots from these relative dates most
times of the year reflects the invalid pivot calculation logic from a
different perspective. This commit reactivates that previously
commented-out test.

Fixes #963
@victorlin
Copy link
Member Author

@huddlej a similar problem is causing CI to fail today, this time being the last day of the month. A small snippet to repro:

from augur.frequency_estimators import get_pivots
from augur.dates import numeric_date
from treetime.utils import datestring_from_numeric

start_date = numeric_date('2022-01-01')
end_date = numeric_date('2022-03-01')
pivots = get_pivots(observations=[], start_date=start_date, end_date=end_date, pivot_interval=1, pivot_interval_units="months")
[datestring_from_numeric(date) for date in pivots]
# ['2022-01-01', '2022-02-01', '2022-03-01']

start_date = numeric_date('2022-01-31')
end_date = numeric_date('2022-03-31')
pivots = get_pivots(observations=[], start_date=start_date, end_date=end_date, pivot_interval=1, pivot_interval_units="months")
[datestring_from_numeric(date) for date in pivots]
# ['2022-02-28', '2022-03-31']

@victorlin victorlin reopened this Jan 31, 2023
@victorlin
Copy link
Member Author

Update: we discussed this on Slack and it turns out this is a side effect of the way the duration is applied on incrementally updated pivot points.

pivot = end
while pivot >= start:
pivots.appendleft(pivot)
pivot = pivot - delta

The proper solution is to determine pivots based on the original end instead of intermediate pivot points. This should be easy with the multiplier operator on isodate.Duration.

@victorlin
Copy link
Member Author

victorlin commented Mar 31, 2023

Well something related cropped up again on today, March 31st. Converted the failing test into an example here:

from augur.frequency_estimators import get_pivots
from augur.dates import numeric_date
from treetime.utils import datestring_from_numeric

start_date = numeric_date('2022-03-30')
end_date = numeric_date('2022-09-30')
pivots = get_pivots(observations=[], start_date=start_date, end_date=end_date, pivot_interval=3, pivot_interval_units="months")
[datestring_from_numeric(date) for date in pivots]
# ['2022-03-30', '2022-06-30', '2022-09-30']

start_date = numeric_date('2022-03-31')
end_date = numeric_date('2022-09-30')
pivots = get_pivots(observations=[], start_date=start_date, end_date=end_date, pivot_interval=3, pivot_interval_units="months")
[datestring_from_numeric(date) for date in pivots]
# ['2022-06-30', '2022-09-30']

The reason is that subtracting monthly units from 2022-09-30 won't produce any dates with the 31 as the day part.


I think this is reasonable behavior. Given the purpose of the failing test:

this test checks the logical consistency of the requested relative dates and pivot interval

and that we can't expect logical consistency, I'd propose to remove this test since relative dates and pivot intervals are already covered by pytests (1, 2).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
2 participants