|
| 1 | +Understanding and accessing versioned data |
| 2 | +========================================== |
| 3 | + |
| 4 | + |
| 5 | +The Epidata API records not just each signal's estimate for a given location |
| 6 | +on a given day, but also *when* that estimate was made, and all updates to that |
| 7 | +estimate. |
| 8 | + |
| 9 | +For example, let's look at the `doctor visits |
| 10 | +signal <https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/doctor-visits.html>`_ |
| 11 | +from the ``covidcast`` `endpoint <https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html>`_, |
| 12 | +which estimates the percentage of outpatient doctor visits that are |
| 13 | +COVID-related. |
| 14 | + |
| 15 | +Consider a result row with ``time_value = 2020-05-01`` for |
| 16 | +``geo_values = "pa"``. This is an estimate for Pennsylvania on |
| 17 | +May 1, 2020. That estimate was *issued* on May 5, 2020, the delay being due to |
| 18 | +the aggregation of data by our source and the time taken by the Epidata API to |
| 19 | +ingest the data provided. |
| 20 | + |
| 21 | +Later, the estimate for May 1st could be updated, |
| 22 | +perhaps because additional visit data from May 1st arrived at our source and was |
| 23 | +reported to us. This constitutes a new *issue* of the data. |
| 24 | + |
| 25 | + |
| 26 | +Data known "as of" a specific date |
| 27 | +---------------------------------- |
| 28 | + |
| 29 | +By default, endpoint functions fetch the most recent issue available. This |
| 30 | +is the best option for users who simply want to graph the latest data or |
| 31 | +construct dashboards. But if we are interested in knowing *when* data was |
| 32 | +reported, we can request specific data versions using the ``as_of``, ``issues``, or |
| 33 | +``lag`` arguments. |
| 34 | + |
| 35 | +**Note** that these are mutually exclusive; only one can be specified |
| 36 | +at a time. Also, not all endpoints support all three parameters, so please |
| 37 | +check the documentation for that specific endpoint. |
| 38 | + |
| 39 | +First, we can request the data that was available *as of* a specific date, using |
| 40 | +the ``as_of`` argument: |
| 41 | + |
| 42 | +>>> from epidatpy import EpiDataContext, EpiRange |
| 43 | +>>> epidata = EpiDataContext(use_cache=True, cache_max_age_days=1) |
| 44 | +>>> apicall = epidata.pub_covidcast( |
| 45 | +... data_source = "doctor-visits", |
| 46 | +... signals = "smoothed_cli", |
| 47 | +... time_type = "day", |
| 48 | +... time_values = EpiRange("2020-05-01", "2020-05-01"), |
| 49 | +... geo_type = "state", |
| 50 | +... geo_values = "pa", |
| 51 | +... as_of = "2020-05-07" |
| 52 | +...) |
| 53 | +>>> apicall.df.head() |
| 54 | + source signal geo_type geo_value time_type time_value issue lag value stderr sample_size direction missing_value missing_stderr missing_sample_size |
| 55 | +0 doctor-visits smoothed_cli state pa day 2020-05-01 2020-05-07 6 2.32192 <NA> <NA> <NA> 0 5 5 |
| 56 | + |
| 57 | +This shows that an estimate of about 2.3% was issued on May 7. If we don't |
| 58 | +specify `as_of`, we get the most recent estimate available: |
| 59 | + |
| 60 | +>>> apicall = epidata.pub_covidcast( |
| 61 | +... data_source = "doctor-visits", |
| 62 | +... signals = "smoothed_cli", |
| 63 | +... time_type = "day", |
| 64 | +... time_values = EpiRange("2020-05-01", "2020-05-01"), |
| 65 | +... geo_type = "state", |
| 66 | +... geo_values = "pa" |
| 67 | +...) |
| 68 | +>>> apicall.df.head() |
| 69 | + source signal geo_type geo_value time_type time_value issue lag value stderr sample_size direction missing_value missing_stderr missing_sample_size |
| 70 | +0 doctor-visits smoothed_cli state pa day 2020-05-01 2020-07-04 64 5.075015 <NA> <NA> <NA> 0 5 5 |
| 71 | + |
| 72 | +Note the substantial change in the estimate, from less than 3% to over 5%, |
| 73 | +reflecting new data that became available after May 7 about visits *occurring on* |
| 74 | +May 1. This illustrates the importance of issue date tracking, particularly |
| 75 | +for forecasting tasks. To backtest a forecasting model on past data, it is |
| 76 | +important to use the data that would have been available *at the time* the model |
| 77 | +was or would have been fit, not data that arrived much later. |
| 78 | + |
| 79 | +Multiple issues of observations |
| 80 | +------------------------------- |
| 81 | + |
| 82 | +By using the ``issues`` argument, we can request all issues in a certain time |
| 83 | +period: |
| 84 | + |
| 85 | +>>> apicall = epidata.pub_covidcast( |
| 86 | +... data_source = "doctor-visits", |
| 87 | +... signals = "smoothed_adj_cli", |
| 88 | +... time_type = "day", |
| 89 | +... time_values = EpiRange("2020-05-01", "2020-05-01"), |
| 90 | +... geo_type = "state", |
| 91 | +... geo_values = "pa", |
| 92 | +... issues = EpiRange("2020-05-01", "2020-05-15") |
| 93 | +...) |
| 94 | +>>> apicall.df.head(7) |
| 95 | + source signal geo_type geo_value time_type time_value issue lag value stderr sample_size direction missing_value missing_stderr missing_sample_size |
| 96 | +0 doctor-visits smoothed_adj_cli state pa day 2020-05-01 2020-05-07 6 2.581509 <NA> <NA> <NA> 0 5 5 |
| 97 | +1 doctor-visits smoothed_adj_cli state pa day 2020-05-01 2020-05-08 7 3.278896 <NA> <NA> <NA> 0 5 5 |
| 98 | +2 doctor-visits smoothed_adj_cli state pa day 2020-05-01 2020-05-09 8 3.321781 <NA> <NA> <NA> 0 5 5 |
| 99 | +3 doctor-visits smoothed_adj_cli state pa day 2020-05-01 2020-05-12 11 3.588683 <NA> <NA> <NA> 0 5 5 |
| 100 | +4 doctor-visits smoothed_adj_cli state pa day 2020-05-01 2020-05-13 12 3.631978 <NA> <NA> <NA> 0 5 5 |
| 101 | +5 doctor-visits smoothed_adj_cli state pa day 2020-05-01 2020-05-14 13 3.658009 <NA> <NA> <NA> 0 5 5 |
| 102 | + |
| 103 | +This estimate was clearly updated many times as new data for May 1st arrived. |
| 104 | + |
| 105 | +**Note** that these results include only data issued or updated between |
| 106 | +(inclusive) 2020-05-01 and 2020-05-15. If a value was first reported on |
| 107 | +2020-04-15, and never updated, a query for issues between 2020-05-01 and |
| 108 | +2020-05-15 will not include that value among its results. |
| 109 | + |
| 110 | +Observations issued with a specific lag |
| 111 | +--------------------------------------- |
| 112 | + |
| 113 | +Finally, we can use the ``lag`` argument to request only data reported with a |
| 114 | +certain lag. For example, requesting a lag of 7 days fetches only data issued |
| 115 | +exactly 7 days after the corresponding ``time_value``: |
| 116 | + |
| 117 | +>>> apicall = epidata.pub_covidcast( |
| 118 | +... data_source = "doctor-visits", |
| 119 | +... signals = "smoothed_adj_cli", |
| 120 | +... time_type = "day", |
| 121 | +... time_values = EpiRange("2020-05-01", "2020-05-07"), |
| 122 | +... geo_type = "state", |
| 123 | +... geo_values = "pa", |
| 124 | +... lag = 7 |
| 125 | +...) |
| 126 | +>>> apicall.df.head() |
| 127 | + source signal geo_type geo_value time_type time_value issue lag value stderr sample_size direction missing_value missing_stderr missing_sample_size |
| 128 | +0 doctor-visits smoothed_adj_cli state pa day 2020-05-01 2020-05-08 7 3.278896 <NA> <NA> <NA> 0 5 5 |
| 129 | +1 doctor-visits smoothed_adj_cli state pa day 2020-05-02 2020-05-09 7 3.225292 <NA> <NA> <NA> 0 5 5 |
| 130 | +2 doctor-visits smoothed_adj_cli state pa day 2020-05-05 2020-05-12 7 2.779908 <NA> <NA> <NA> 0 5 5 |
| 131 | +3 doctor-visits smoothed_adj_cli state pa day 2020-05-06 2020-05-13 7 2.557698 <NA> <NA> <NA> 0 5 5 |
| 132 | +4 doctor-visits smoothed_adj_cli state pa day 2020-05-07 2020-05-14 7 2.191677 <NA> <NA> <NA> 0 5 5 |
| 133 | + |
| 134 | +**Note** that though this query requested all values between 2020-05-01 and |
| 135 | +2020-05-07, May 3rd and May 4th were *not* included in the results set. This is |
| 136 | +because the query will only include a result for May 3rd if a value were issued |
| 137 | +on May 10th (a 7-day lag), but in fact the value was not updated on that day: |
| 138 | + |
| 139 | +>>> apicall = epidata.pub_covidcast( |
| 140 | +... data_source = "doctor-visits", |
| 141 | +... signals = "smoothed_adj_cli", |
| 142 | +... time_type = "day", |
| 143 | +... time_values = EpiRange("2020-05-03", "2020-05-03"), |
| 144 | +... geo_type = "state", |
| 145 | +... geo_values = "pa", |
| 146 | +... issues = EpiRange("2020-05-09", "2020-05-15") |
| 147 | +...) |
| 148 | +>>> apicall.df.head() |
| 149 | + source signal geo_type geo_value time_type time_value issue lag value stderr sample_size direction missing_value missing_stderr missing_sample_size |
| 150 | +0 doctor-visits smoothed_adj_cli state pa day 2020-05-03 2020-05-09 6 2.788618 <NA> <NA> <NA> 0 5 5 |
| 151 | +1 doctor-visits smoothed_adj_cli state pa day 2020-05-03 2020-05-12 9 3.015368 <NA> <NA> <NA> 0 5 5 |
| 152 | +2 doctor-visits smoothed_adj_cli state pa day 2020-05-03 2020-05-13 10 3.03931 <NA> <NA> <NA> 0 5 5 |
| 153 | +3 doctor-visits smoothed_adj_cli state pa day 2020-05-03 2020-05-14 11 3.021245 <NA> <NA> <NA> 0 5 5 |
| 154 | +4 doctor-visits smoothed_adj_cli state pa day 2020-05-03 2020-05-15 12 3.048725 <NA> <NA> <NA> 0 5 5 |
0 commit comments