Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

frequencies: Fix pivot logic to always include the end date #1121

Merged
merged 7 commits into from
Jan 19, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,11 @@

## __NEXT__

### Major Changes

* frequencies: Changes the logic for calculating the time points when frequencies are estimated to ensure that the user-provided "end date" is always included. This change in the behavior of the frequencies command fixes a bug where large intervals between time points (e.g., 3 months) could cause recent data to be omitted from frequency calculations. See the pull request for more details included the scientific implications of this bug. [#1121][] (@huddlej)

[#1121]: https://github.com/nextstrain/augur/pull/1121

## 19.3.0 (19 January 2023)

Expand Down
36 changes: 26 additions & 10 deletions augur/frequency_estimators.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,17 @@
# estimates clade frequencies
from __future__ import division, print_function
from collections import defaultdict
from collections import defaultdict, deque
import datetime
import isodate
import numpy as np
import pandas as pd
from scipy.interpolate import interp1d
from scipy.stats import norm
import sys
import time

from .dates import numeric_date

debug = False
log_thres = 10.0

Expand All @@ -24,7 +27,9 @@ def get_pivots(observations, pivot_interval, start_date=None, end_date=None, piv
interval between pivots.

Start and end pivots will be based on the range of given observed dates,
unless a start or end date are provided to override these defaults.
unless a start or end date are provided to override these defaults. Pivots
include the end date and, where possible for the given interval, the start
date.

Parameters
----------
Expand Down Expand Up @@ -54,18 +59,29 @@ def get_pivots(observations, pivot_interval, start_date=None, end_date=None, piv
pivot_end = end_date if end_date else np.ceil(np.max(observations) / pivot_frequency) * pivot_frequency

if pivot_interval_units == "months":
offset = "%sMS" % pivot_interval
duration_str = f'P{pivot_interval}M'
elif pivot_interval_units == "weeks":
offset = "%sW" % pivot_interval
duration_str = f'P{pivot_interval}W'
else:
raise ValueError(f"The given interval unit '{pivot_interval_units}' is not supported.")

datetime_pivots = pd.date_range(
float_to_datestring(pivot_start),
float_to_datestring(pivot_end),
freq = offset
)
pivots = np.array([timestamp_to_float(pivot) for pivot in datetime_pivots])
# Construct standard Python date instances from numeric dates via ISO-8601
# dates and the corresponding delta time for the interval between pivots.
start = datetime.datetime.strptime(float_to_datestring(pivot_start), "%Y-%m-%d")
end = datetime.datetime.strptime(float_to_datestring(pivot_end), "%Y-%m-%d")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above, pivot_end is defined as a floating point (not changed in this PR). but doesn't this defeat the entire purpose of doing month/week increments to maintain human readable dates? And if the last data point is on Jan 2nd and the interval is 3 months, the pivot_end is 3 months into the year.

What if we instead did something like this:

last_data_point = pd.Timestamp(datetime_from_numeric(end_date or np.max(observations))).ceil(freq='D')

if pivot_interval_unit=='month':
    while not last_data_point.is_month_end:
         last_data_point += pd.Timedelta(days=1)
elif: pivot_interval_unit=='weeks':
    while not last_data_point.weekday()!=6:
         last_data_point += pd.Timedelta(days=1)

pivot_end = last_data_point

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would put the end point to the next full month or week after the last data point.

Copy link
Contributor Author

@huddlej huddlej Jan 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above, pivot_end is defined as a floating point (not changed in this PR). but doesn't this defeat the entire purpose of doing month/week increments to maintain human readable dates?

The floating point representation of pivot_start and pivot_end is just an intermediate state as far as date calculations. The user can provide a human-readable (ISO-8601) date from the command line, it gets turned into a floating point value for historical reasons, but then it gets converted back to ISO-8601 before performing date math. pivot_end and pivot_start could be defined as human-readable (ISO-8601) dates from the start and only lead to floating point values at the end of the function. The floating point values returned by get_pivots can be converted back to the expected ISO-8601 dates with float_to_datestring. The benefit of the ISO-8601 representation is that we can perform standard date math like:

from augur.frequency_estimators import get_pivots, float_to_datestring
from augur.dates import numeric_date

# Find one month prior to March 6, 2023 with standard date logic.
# This should be February 6, 2023.
>>> date(2023, 3, 6) - relativedelta(months=1)
datetime.date(2023, 2, 6)

# Find one month prior to March 6, 2023 with floating point logic.
# What is "one month" in floating point values? If it is 1 / 12, we get the following result.
>>> float_to_datestring(numeric_date("2023-03-06") - (1 / 12))
'2023-02-04'

This is obviously only a problem for months since they don't have the same number of days. Week intervals are consistent between standard date math and floating point math.

And if the last data point is on Jan 2nd and the interval is 3 months, the pivot_end is 3 months into the year.

Ah, that's right and not great. With a single observation from that date, you'd get:

>>> [float_to_datestring(pivot) for pivot in get_pivots([numeric_date("2023-01-02")], 3)]
['2023-01-02', '2023-04-02']

We usually run augur frequencies with an end date of the current day, though. In the current PR, if you had a single observation from Jan 2 and an end date of today, you'd get:

>>> [float_to_datestring(pivot) for pivot in get_pivots([numeric_date("2023-01-02")], 3, None, numeric_date("2023-01-11"))]
['2023-01-11']

But in the current implementation in master, you'd get:

>>> [float_to_datestring(pivot) for pivot in get_pivots([numeric_date("2023-01-02")], 3, None, numeric_date("2023-01-11"))]
['2023-01-01']

and throw out your single observation. For a monthly interval, your code example above would produce:

>>> import datetime
>>> import numpy as np
>>> import pandas as pd
>>> from treetime.utils import datetime_from_numeric 
>>> end_date = None
>>> observations = [numeric_date("2023-01-02")]
>>> last_data_point = pd.Timestamp(datetime_from_numeric(end_date or np.max(observations))).ceil(freq='D')
>>>  while not last_data_point.is_month_end:
...    last_data_point += pd.Timedelta(days=1)
>>> last_data_point
Timestamp('2023-01-31 00:00:00')

Since the pivots slice through a specific time in the KDE distributions instead of summing over a time interval, is there a benefit to moving the last pivot any later than we have data instead of using the latest observation date as the end date?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I didn't realize TreeTime already had datetime_from_numeric and datestring_from_numeric. We could just use the latter instead of Augur's float_to_datestring.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, John. and you are right, the end_date solves this problem in most cases. But it also means that our monthly pivots are not aligned with the beginning or end of a month. If that's ok, we might just do month=30days and week=7days.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pivots for KDE frequencies should never extend more than the narrow width beyond the last time window were there is more or less representative data. They would be too heavily influenced by fluctuations in that case. This is less of an issue for the diffusion frequencies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it also means that our monthly pivots are not aligned with the beginning or end of a month.

That alignment to the beginning of a month was the original goal of the pandas-based implementation, but it is a pretty arbitrary goal. Even representing intervals by month feels arbitrary when "month" doesn't have a fixed value. The simpler approach of using weeks is appealing without needing to cast "month" to a fixed number of days. I would be happy to switch flu pivots to every 4 weeks, for instance.

Pivots for KDE frequencies should never extend more than the narrow width beyond the last time window were there is more or less representative data.

This also makes a lot of sense. Estimating KDE frequencies at a date beyond the most recent data is an error in the opposite direction of the original bug addressed in this PR. By that logic, we should change the following logic from:

pivot_start = start_date if start_date else np.floor(np.min(observations) / pivot_frequency) * pivot_frequency
pivot_end = end_date if end_date else np.ceil(np.max(observations) / pivot_frequency) * pivot_frequency

to always use a min/max date based on the earliest and latest sample, instead of expanding the range of pivots:

pivot_start = start_date if start_date else np.min(observations)
pivot_end = end_date if end_date else np.max(observations)

Our tree can't plot any dates beyond the most recent tip, so it makes sense that frequencies shouldn't plot beyond the most recent tip either. The behavior of that code I propose there doesn't account for our most common use case where we do provide an "end date" that is usually today even if the most recent data are never collected from today. Providing an "end date" only makes sense in the context of retrospective work like the flu forecasting project, where we want to estimate frequencies in well-defined windows across existing data.

Providing an earlier "start date" makes sense, though, since it allows the pivots to include all available data. The code above could produce pivots that don't include the earliest observation, depending on the pivot interval and end date. So, we probably want something like this that keeps the earlier start date than min observation but uses the max observation as the end date.

pivot_start = start_date if start_date else np.floor(np.min(observations) / pivot_frequency) * pivot_frequency
pivot_end = end_date if end_date else np.max(observations)

The existing changes in this PR will ensure that the end date is always the last pivot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is more work we could do in this area, but I think we should leave treatment of observed data (without start and end dates) to another PR. This PR fixes a bug that immediately affects our flu work, so I'd like to merge and release it soon. I documented this conversation as a new issue.

delta = isodate.parse_duration(duration_str)

# Start calculating pivots from the end date (inclusive), working backwards
# in time by a time delta that matches the user-requested interval. Include
# the start date in the final pivots when the interval spacing allows that
# date to be included.
pivots = deque([])
pivot = end
while pivot >= start:
pivots.appendleft(pivot)
huddlej marked this conversation as resolved.
Show resolved Hide resolved
pivot = pivot - delta

pivots = np.array([numeric_date(pivot) for pivot in pivots])

return np.around(pivots, 4)

Expand Down
90 changes: 45 additions & 45 deletions tests/builds/zika/auspice/zika_tip-frequencies.json
Original file line number Diff line number Diff line change
@@ -1,92 +1,92 @@
{
"BRA/2016/FC_6706": {
"frequencies": [
0.022579,
0.011312,
0.206983,
0.102282
0.022368,
0.011759,
0.207309,
0.102551
]
},
"COL/FLR_00008/2015": {
"frequencies": [
0.243453,
0.208443,
0.008645,
0.010659
0.244555,
0.204597,
0.008456,
0.010737
]
},
"Colombia/2016/ZC204Se": {
"frequencies": [
0.126056,
0.231597,
0.014167,
0.017099
0.12489,
0.234574,
0.013671,
0.017245
]
},
"DOM/2016/BB_0183": {
"frequencies": [
0.017888,
0.009342,
0.183671,
0.148759
0.017736,
0.009645,
0.185763,
0.148494
]
},
"EcEs062_16": {
"frequencies": [
0.018981,
0.009763,
0.190994,
0.134398
0.018817,
0.010092,
0.192704,
0.134289
]
},
"HND/2016/HU_ME59": {
"frequencies": [
0.009484,
0.006254,
0.090552,
0.457824
0.009425,
0.006444,
0.093589,
0.456546
]
},
"PAN/CDC_259359_V1_V3/2015": {
"frequencies": [
0.224832,
0.214575,
0.008963,
0.011176
0.225577,
0.211235,
0.008761,
0.011258
]
},
"PRVABC59": {
"frequencies": [
0.243453,
0.208443,
0.008645,
0.010659
0.244555,
0.204597,
0.008456,
0.010737
]
},
"VEN/UF_1/2016": {
"frequencies": [
0.030662,
0.016483,
0.206983,
0.070196
0.030336,
0.017436,
0.204461,
0.07079
]
},
"ZKC2/2016": {
"frequencies": [
0.062612,
0.083787,
0.080395,
0.036947
0.06174,
0.089622,
0.076833,
0.037353
]
},
"generated_by": {
"program": "augur",
"version": "7.0.2"
"version": "19.2.0"
},
"pivots": [
2015.75,
2016.0,
2016.25,
2016.5
2015.7521,
2016.0041,
2016.2527,
2016.5014
]
}
44 changes: 21 additions & 23 deletions tests/functional/frequencies.t
Original file line number Diff line number Diff line change
Expand Up @@ -34,26 +34,24 @@ Pivots get calculated from the requested date range.
Calculate KDE-based tip frequencies for a time period with relative dates.
Testing relative dates deterministically from the shell is tricky.
To keep these tests simple and avoid freezing the system date to specific values, this test checks the logical consistency of the requested relative dates and pivot interval.
With a minimum date of 1 year (12 months) ago, a maximum date of 6 months ago, and a pivot interval of 3 months, we expect only 2 pivots in the final output.
Since the test data are much older than the time period requested, all strains will always have frequencies of 0 for the 2 requested pivots.
As long as we always calculate 2 pivots with frequencies of 0 for all strains, we can ignore the actual pivot values calculated for the relative dates in the diff below.

TODO: un-comment this test, which is failing on 2022-06-01 with 3 pivot points: https://github.com/nextstrain/augur/runs/6694014041

$ ${AUGUR} frequencies \
> --method kde \
> --tree "frequencies/tree.nwk" \
> --metadata "frequencies/metadata.tsv" \
> --pivot-interval 3 \
> --min-date 1Y \
> --max-date 6M \
> --output "$TMP/tip-frequencies.json" > /dev/null

$ python3 "$TESTDIR/../../scripts/diff_jsons.py" \
> --exclude-paths "root['generated_by']['version']" "root['pivots']" -- \
> "frequencies/zika_tip-frequencies_with_relative_dates.json" \
> "$TMP/tip-frequencies.json"
{}
$ rm -f "$TMP/tip-frequencies.json"

$ popd > /dev/null
With a minimum date of 12 months ago, a maximum date of 6 months ago, and a pivot interval of 3 months, we always expect 3 pivots in the final output corresponding to the end date (always included), the 3-month pivot before that end date, and the start date (also should always be included).
Since the test data are much older than the time period requested, all strains will always have frequencies of 0 for the 3 requested pivots.
As long as we always calculate 3 pivots with frequencies of 0 for all strains, we can ignore the actual pivot values calculated for the relative dates in the diff below.

$ ${AUGUR} frequencies \
> --method kde \
> --tree "frequencies/tree.nwk" \
> --metadata "frequencies/metadata.tsv" \
> --pivot-interval 3 \
> --min-date 12M \
> --max-date 6M \
> --output "$TMP/tip-frequencies.json" > /dev/null

$ python3 "$TESTDIR/../../scripts/diff_jsons.py" \
> --exclude-paths "root['generated_by']['version']" "root['pivots']" -- \
> "frequencies/zika_tip-frequencies_with_relative_dates.json" \
> "$TMP/tip-frequencies.json"
{}
$ rm -f "$TMP/tip-frequencies.json"

$ popd > /dev/null
Loading