frequencies: Number of pivot points differs for same date range when both start/end date are inclusive #963

victorlin · 2022-06-01T18:34:04Z

Current Behavior

Setup:

from augur.frequency_estimators import get_pivots
from augur.dates import numeric_date

Getting monthly pivots between 2022-01-02 and 2022-03-02 gives 2 pivots:

start_date = numeric_date('2022-01-02')
end_date = numeric_date('2022-03-02')
get_pivots(observations=[], start_date=start_date, end_date=end_date, pivot_interval=1, pivot_interval_units="months")
# out: array([2022.0833, 2022.1667])
# == [2022-02-01, 2022-03-01]

However, shifting back one day to 2022-01-01 and 2022-03-01 gives 3 pivots:

start_date = numeric_date('2022-01-01')
end_date = numeric_date('2022-03-01')
get_pivots(observations=[], start_date=start_date, end_date=end_date, pivot_interval=1, pivot_interval_units="months")
# out: array([2022.    , 2022.0833, 2022.1667])
# == [2022-01-01, 2022-02-01, 2022-03-01]

Expected behavior

The same date range should produce the same number of pivots (i.e. 2022-01-01 and 2022-03-01 should still give 2 pivots).

Explanation

This is due to 2 behaviors in current usage of pd.date_range():

The pivot points are fixed at the start of every month:

augur/augur/frequency_estimators.py

Lines 56 to 57 in f68c448

if pivot_interval_units == "months":

offset = "%sMS" % pivot_interval
The default parameter of closed=None to pd.date_range() means the interval is inclusive of both start_date and end_date.

It seems like this behavior is intentional, given the test here:

augur/tests/test_frequencies.py

Lines 85 to 93 in f68c448

    
           def test_get_pivots_by_months(): 
        
               """Get pivots where intervals are defined by months. 
        
               """ 
        
               pivots = get_pivots(observations=[], pivot_interval=1, start_date=2015.0, end_date=2016.0, pivot_interval_units="months") 
        
               # Pivots should include all 12 months of the year plus the month represented 
        
               # by the end date, since the pandas month interval uses "month starts". See 
        
               # pandas date offsets documentation for more details: 
        
               # https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases 
        
               assert len(pivots) == 13

Possible solution

In the call to pd.date_range()

augur/augur/frequency_estimators.py

Lines 63 to 66 in f68c448

    
           datetime_pivots = pd.date_range( 
        
               float_to_datestring(pivot_start), 
        
               float_to_datestring(pivot_end), 
        
               freq = offset

specify either closed='left' or closed='right' so the range is inclusive on only one side. However, I'm not sure which is more meaningful for the application here. It also breaks a few existing tests, see failing tests on victorlin@1487790.

@huddlej what do you think of this approach?

Your environment: if running Nextstrain locally

Version (e.g. auspice 2.7.0): Augur 15.0.2

The text was updated successfully, but these errors were encountered:

Updates the specific values of expected frequencies for functional tests, since the corrected pivot calculation logic produces new pivots and correspondingly new frequency values per pivot. One nice side effect of the new pivot calculations is that they produce the correct number of pivots in our functional test of relative min and max date values that had previously failed in a flaky manner. The corrected pivot logic reveals that the "flaky" behavior of that test was actually the correct behavior: we should always get 3 pivots when the max date is 6 months ago, the min date is 12 months ago, and the pivot interval is 3 months. That we previously only got 2 pivots from these relative dates most times of the year reflects the invalid pivot calculation logic from a different perspective. This commit reactivates that previously commented-out test. Fixes #963

victorlin · 2023-01-31T22:45:09Z

@huddlej a similar problem is causing CI to fail today, this time being the last day of the month. A small snippet to repro:

from augur.frequency_estimators import get_pivots
from augur.dates import numeric_date
from treetime.utils import datestring_from_numeric

start_date = numeric_date('2022-01-01')
end_date = numeric_date('2022-03-01')
pivots = get_pivots(observations=[], start_date=start_date, end_date=end_date, pivot_interval=1, pivot_interval_units="months")
[datestring_from_numeric(date) for date in pivots]
# ['2022-01-01', '2022-02-01', '2022-03-01']

start_date = numeric_date('2022-01-31')
end_date = numeric_date('2022-03-31')
pivots = get_pivots(observations=[], start_date=start_date, end_date=end_date, pivot_interval=1, pivot_interval_units="months")
[datestring_from_numeric(date) for date in pivots]
# ['2022-02-28', '2022-03-31']

victorlin · 2023-02-01T23:41:59Z

Update: we discussed this on Slack and it turns out this is a side effect of the way the duration is applied on incrementally updated pivot points.

augur/augur/frequency_estimators.py

Lines 79 to 82 in e12dbe1

    
           pivot = end 
        
           while pivot >= start: 
        
               pivots.appendleft(pivot) 
        
               pivot = pivot - delta

The proper solution is to determine pivots based on the original end instead of intermediate pivot points. This should be easy with the multiplier operator on isodate.Duration.

victorlin · 2023-03-31T17:48:48Z

Well something related cropped up again on today, March 31st. Converted the failing test into an example here:

from augur.frequency_estimators import get_pivots
from augur.dates import numeric_date
from treetime.utils import datestring_from_numeric

start_date = numeric_date('2022-03-30')
end_date = numeric_date('2022-09-30')
pivots = get_pivots(observations=[], start_date=start_date, end_date=end_date, pivot_interval=3, pivot_interval_units="months")
[datestring_from_numeric(date) for date in pivots]
# ['2022-03-30', '2022-06-30', '2022-09-30']

start_date = numeric_date('2022-03-31')
end_date = numeric_date('2022-09-30')
pivots = get_pivots(observations=[], start_date=start_date, end_date=end_date, pivot_interval=3, pivot_interval_units="months")
[datestring_from_numeric(date) for date in pivots]
# ['2022-06-30', '2022-09-30']

The reason is that subtracting monthly units from 2022-09-30 won't produce any dates with the 31 as the day part.

I think this is reasonable behavior. Given the purpose of the failing test:

this test checks the logical consistency of the requested relative dates and pivot interval

and that we can't expect logical consistency, I'd propose to remove this test since relative dates and pivot intervals are already covered by pytests (1, 2).

victorlin added the bug Something isn't working label Jun 1, 2022

victorlin mentioned this issue Jun 1, 2022

Temporarily disable failing test #962

Merged

victorlin assigned huddlej Jun 8, 2022

huddlej mentioned this issue Jan 8, 2023

frequencies: Fix pivot logic to always include the end date #1121

Merged

4 tasks

huddlej closed this as completed in #1121 Jan 19, 2023

huddlej closed this as completed in beca8c4 Jan 19, 2023

victorlin reopened this Jan 31, 2023

victorlin mentioned this issue Feb 13, 2023

frequencies: Calculate all pivot points based on end pivot #1150

Merged

3 tasks

victorlin closed this as completed in #1150 Mar 8, 2023

victorlin reopened this Mar 31, 2023

victorlin mentioned this issue Mar 31, 2023

frequencies: Remove test that fails on a few days each year #1194

Merged

2 tasks

victorlin closed this as completed Mar 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

frequencies: Number of pivot points differs for same date range when both start/end date are inclusive #963

frequencies: Number of pivot points differs for same date range when both start/end date are inclusive #963

victorlin commented Jun 1, 2022

victorlin commented Jan 31, 2023

victorlin commented Feb 1, 2023

victorlin commented Mar 31, 2023 •

edited

Loading

frequencies: Number of pivot points differs for same date range when both start/end date are inclusive #963

frequencies: Number of pivot points differs for same date range when both start/end date are inclusive #963

Comments

victorlin commented Jun 1, 2022

Current Behavior

Expected behavior

Explanation

Possible solution

Your environment: if running Nextstrain locally

victorlin commented Jan 31, 2023

victorlin commented Feb 1, 2023

victorlin commented Mar 31, 2023 • edited Loading

victorlin commented Mar 31, 2023 •

edited

Loading