Skip to content

Commit 99fee98

Browse files
committed
Port all initial vignettes
1 parent 4443f45 commit 99fee98

File tree

4 files changed

+330
-5
lines changed

4 files changed

+330
-5
lines changed

docs/index.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,3 +76,6 @@ Contents
7676

7777
getting_started_with_epidatpy
7878

79+
signal_discovery
80+
81+
versioned_data

docs/signal_discovery.rst

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
2+
Finding data sources and signals of interest
3+
============================================
4+
5+
The Epidata API includes numerous data streams -- medical claims data, cases and deaths,
6+
mobility, and many others -- covering different geographic regions. This can make it a
7+
challenge to find the data stream that you are most interested in.
8+
9+
Example queries with all the endpoint functions available in this package are
10+
given below.
11+
12+
13+
Using the documentation
14+
-----------------------
15+
16+
The Epidata documentation lists all the data sources and signals available
17+
through the API for
18+
`COVID-19 <https://cmu-delphi.github.io/delphi-epidata/api/covidcast_signals.html>`_ and
19+
for `other diseases <https://cmu-delphi.github.io/delphi-epidata/api/README.html#source-specific-parameters>`_.
20+
The site also includes a search tool if you have a keyword (e.g. "Taiwan") in mind.
21+
22+
23+
Signal metadata
24+
---------------
25+
26+
The ``source_df`` property lets us obtain a Pandas DataFrame of metadata describing all
27+
data streams which are publically accessible from the COVIDcast API. See the `data source
28+
and signals documentation <https://cmu-delphi.github.io/delphi-epidata/api/covidcast_signals.html>`_
29+
for descriptions of the available sources.
30+
31+
>>> from epidatpy import CovidcastEpidata
32+
>>> epidata = CovidcastEpidata()
33+
>>> sources = epidata.source_df
34+
>>> sources.head()
35+
source name description reference_signal license dua signals
36+
0 chng Change Healthcare Change Healthcare is a healthcare technology c... smoothed_outpatient_cli CC BY-NC https://cmu.box.com/s/cto4to822zecr3oyq1kkk9xm... smoothed_outpatient_cli,smoothed_adj_outpatien...
37+
1 covid-act-now Covid Act Now (CAN) COVID Act Now (CAN) tracks COVID-19 testing st... pcr_specimen_total_tests CC BY-NC None pcr_specimen_positivity_rate,pcr_specimen_tota...
38+
2 doctor-visits Doctor Visits From Claims Information about outpatient visits, provided ... smoothed_cli CC BY https://cmu.box.com/s/l2tz6kmiws6jyty2azwb43po... smoothed_cli,smoothed_adj_cli
39+
3 fb-survey Delphi US COVID-19 Trends and Impact Survey We conduct the Delphi US COVID-19 Trends and I... smoothed_cli CC BY https://cmu.box.com/s/qfxplcdrcn9retfzx4zniyug... raw_wcli,raw_cli,smoothed_cli,smoothed_wcli,ra...
40+
4 google-symptoms Google Symptoms Search Trends Google's [COVID-19 Search Trends symptoms data... s05_smoothed_search To download or use the data, you must agree to... None ageusia_raw_search,ageusia_smoothed_search,ano...
41+
42+
This DataFrame contains the following columns:
43+
44+
- ``source`` - Data source name.
45+
- ``signal`` - Signal name.
46+
- ``description`` - Description of the signal.
47+
- ``reference_signal`` - Geographic level for which this signal is available, such as county, state, msa, hss, hrr, or nation. Most signals are available at multiple geographic levels and will hence be listed in multiple rows with their own metadata.
48+
- ``license`` - The license
49+
- ``dua`` - Link to the Data Use Agreement.
50+
51+
The ``signal_df`` DataFrame can also be used to obtain information about the signals
52+
that are available - for example, what time range they are available for,
53+
and when they have been updated.
54+
55+
>>> signals = epidata.signal_df
56+
>>> signals.head()
57+
source signal name active short_description description time_type time_label value_label format category high_values_are is_smoothed is_weighted is_cumulative has_stderr has_sample_size geo_types
58+
0 chng smoothed_outpatient_cli COVID-Related Doctor Visits False Estimated percentage of outpatient doctor visi... Estimated percentage of outpatient doctor visi... day Date Value raw early bad True False False False False county,hhs,hrr,msa,nation,state
59+
1 chng smoothed_adj_outpatient_cli COVID-Related Doctor Visits (Day-adjusted) False Estimated percentage of outpatient doctor visi... Estimated percentage of outpatient doctor visi... day Date Value raw early bad True False False False False county,hhs,hrr,msa,nation,state
60+
2 chng smoothed_outpatient_covid COVID-Confirmed Doctor Visits False COVID-Confirmed Doctor Visits Estimated percentage of outpatient doctor visi... day Date Value raw early bad True False False False False county,hhs,hrr,msa,nation,state
61+
3 chng smoothed_adj_outpatient_covid COVID-Confirmed Doctor Visits (Day-adjusted) False COVID-Confirmed Doctor Visits Estimated percentage of outpatient doctor visi... day Date Value raw early bad True False False False False county,hhs,hrr,msa,nation,state
62+
4 chng smoothed_outpatient_flu Influenza-Confirmed Doctor Visits False Estimated percentage of outpatient doctor visi... Estimated percentage of outpatient doctor visi... day Day Value raw early bad True False False None None county,hhs,hrr,msa,nation,state
63+
64+
This DataFrame contains one row each available signal, with the following columns:
65+
66+
- ``data_source`` - Data source name.
67+
- ``signal`` - Signal name.
68+
- ``name`` - Name of signal.
69+
- ``active`` - Whether the signal is currently not updated or not. Signals may be inactive because the sources have become unavailable, other sources have replaced them, or additional work is required for us to continue updating them.
70+
- ``short_description`` - Brief description of the signal.
71+
- ``description`` - Full description of the signal.
72+
- ``geo_types`` - Spatial resolution of the signal (e.g., `county`, `hrr`, `msa`, `dma`, `state`). More detail about all `geo_types` is given in the `geographic coding documentation <https://cmu-delphi.github.io/delphi-epidata/api/covidcast_geography.html>`_.
73+
- ``time_type`` - Temporal resolution of the signal (e.g., day, week; see `date coding details <https://cmu-delphi.github.io/delphi-epidata/api/covidcast_times.html>`_).
74+
- ``time_label`` - The time label ("Date", "Week").
75+
- ``value_label`` - The value label ("Value", "Percentage", "Visits", "Visits per 100,000 people").
76+
- ``format`` - The value format ("per100k", "percent", "fraction", "count", "raw").
77+
- ``category`` - The signal category ("early", "public", "late", "other").
78+
- ``high_values_are``- What the higher value of signal indicates ("good", "bad", "neutral").
79+
- ``is_smoothed`` - Whether the signal is smoothed.
80+
- ``is_weighted`` - Whether the signal is weighted.
81+
- ``is_cumulative`` - Whether the signal is cumulative.
82+
- ``has_stderr`` - Whether the signal has `stderr` statistic.
83+
- ``has_sample_size`` - Whether the signal has `sample_size` statistic.

docs/versioned_data.rst

Lines changed: 154 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,154 @@
1+
Understanding and accessing versioned data
2+
==========================================
3+
4+
5+
The Epidata API records not just each signal's estimate for a given location
6+
on a given day, but also *when* that estimate was made, and all updates to that
7+
estimate.
8+
9+
For example, let's look at the `doctor visits
10+
signal <https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/doctor-visits.html>`_
11+
from the ``covidcast`` `endpoint <https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html>`_,
12+
which estimates the percentage of outpatient doctor visits that are
13+
COVID-related.
14+
15+
Consider a result row with ``time_value = 2020-05-01`` for
16+
``geo_values = "pa"``. This is an estimate for Pennsylvania on
17+
May 1, 2020. That estimate was *issued* on May 5, 2020, the delay being due to
18+
the aggregation of data by our source and the time taken by the Epidata API to
19+
ingest the data provided.
20+
21+
Later, the estimate for May 1st could be updated,
22+
perhaps because additional visit data from May 1st arrived at our source and was
23+
reported to us. This constitutes a new *issue* of the data.
24+
25+
26+
Data known "as of" a specific date
27+
----------------------------------
28+
29+
By default, endpoint functions fetch the most recent issue available. This
30+
is the best option for users who simply want to graph the latest data or
31+
construct dashboards. But if we are interested in knowing *when* data was
32+
reported, we can request specific data versions using the ``as_of``, ``issues``, or
33+
``lag`` arguments.
34+
35+
**Note** that these are mutually exclusive; only one can be specified
36+
at a time. Also, not all endpoints support all three parameters, so please
37+
check the documentation for that specific endpoint.
38+
39+
First, we can request the data that was available *as of* a specific date, using
40+
the ``as_of`` argument:
41+
42+
>>> from epidatpy import EpiDataContext, EpiRange
43+
>>> epidata = EpiDataContext(use_cache=True, cache_max_age_days=1)
44+
>>> apicall = epidata.pub_covidcast(
45+
... data_source = "doctor-visits",
46+
... signals = "smoothed_cli",
47+
... time_type = "day",
48+
... time_values = EpiRange("2020-05-01", "2020-05-01"),
49+
... geo_type = "state",
50+
... geo_values = "pa",
51+
... as_of = "2020-05-07"
52+
...)
53+
>>> apicall.df.head()
54+
source signal geo_type geo_value time_type time_value issue lag value stderr sample_size direction missing_value missing_stderr missing_sample_size
55+
0 doctor-visits smoothed_cli state pa day 2020-05-01 2020-05-07 6 2.32192 <NA> <NA> <NA> 0 5 5
56+
57+
This shows that an estimate of about 2.3% was issued on May 7. If we don't
58+
specify `as_of`, we get the most recent estimate available:
59+
60+
>>> apicall = epidata.pub_covidcast(
61+
... data_source = "doctor-visits",
62+
... signals = "smoothed_cli",
63+
... time_type = "day",
64+
... time_values = EpiRange("2020-05-01", "2020-05-01"),
65+
... geo_type = "state",
66+
... geo_values = "pa"
67+
...)
68+
>>> apicall.df.head()
69+
source signal geo_type geo_value time_type time_value issue lag value stderr sample_size direction missing_value missing_stderr missing_sample_size
70+
0 doctor-visits smoothed_cli state pa day 2020-05-01 2020-07-04 64 5.075015 <NA> <NA> <NA> 0 5 5
71+
72+
Note the substantial change in the estimate, from less than 3% to over 5%,
73+
reflecting new data that became available after May 7 about visits *occurring on*
74+
May 1. This illustrates the importance of issue date tracking, particularly
75+
for forecasting tasks. To backtest a forecasting model on past data, it is
76+
important to use the data that would have been available *at the time* the model
77+
was or would have been fit, not data that arrived much later.
78+
79+
Multiple issues of observations
80+
-------------------------------
81+
82+
By using the ``issues`` argument, we can request all issues in a certain time
83+
period:
84+
85+
>>> apicall = epidata.pub_covidcast(
86+
... data_source = "doctor-visits",
87+
... signals = "smoothed_adj_cli",
88+
... time_type = "day",
89+
... time_values = EpiRange("2020-05-01", "2020-05-01"),
90+
... geo_type = "state",
91+
... geo_values = "pa",
92+
... issues = EpiRange("2020-05-01", "2020-05-15")
93+
...)
94+
>>> apicall.df.head(7)
95+
source signal geo_type geo_value time_type time_value issue lag value stderr sample_size direction missing_value missing_stderr missing_sample_size
96+
0 doctor-visits smoothed_adj_cli state pa day 2020-05-01 2020-05-07 6 2.581509 <NA> <NA> <NA> 0 5 5
97+
1 doctor-visits smoothed_adj_cli state pa day 2020-05-01 2020-05-08 7 3.278896 <NA> <NA> <NA> 0 5 5
98+
2 doctor-visits smoothed_adj_cli state pa day 2020-05-01 2020-05-09 8 3.321781 <NA> <NA> <NA> 0 5 5
99+
3 doctor-visits smoothed_adj_cli state pa day 2020-05-01 2020-05-12 11 3.588683 <NA> <NA> <NA> 0 5 5
100+
4 doctor-visits smoothed_adj_cli state pa day 2020-05-01 2020-05-13 12 3.631978 <NA> <NA> <NA> 0 5 5
101+
5 doctor-visits smoothed_adj_cli state pa day 2020-05-01 2020-05-14 13 3.658009 <NA> <NA> <NA> 0 5 5
102+
103+
This estimate was clearly updated many times as new data for May 1st arrived.
104+
105+
**Note** that these results include only data issued or updated between
106+
(inclusive) 2020-05-01 and 2020-05-15. If a value was first reported on
107+
2020-04-15, and never updated, a query for issues between 2020-05-01 and
108+
2020-05-15 will not include that value among its results.
109+
110+
Observations issued with a specific lag
111+
---------------------------------------
112+
113+
Finally, we can use the ``lag`` argument to request only data reported with a
114+
certain lag. For example, requesting a lag of 7 days fetches only data issued
115+
exactly 7 days after the corresponding ``time_value``:
116+
117+
>>> apicall = epidata.pub_covidcast(
118+
... data_source = "doctor-visits",
119+
... signals = "smoothed_adj_cli",
120+
... time_type = "day",
121+
... time_values = EpiRange("2020-05-01", "2020-05-07"),
122+
... geo_type = "state",
123+
... geo_values = "pa",
124+
... lag = 7
125+
...)
126+
>>> apicall.df.head()
127+
source signal geo_type geo_value time_type time_value issue lag value stderr sample_size direction missing_value missing_stderr missing_sample_size
128+
0 doctor-visits smoothed_adj_cli state pa day 2020-05-01 2020-05-08 7 3.278896 <NA> <NA> <NA> 0 5 5
129+
1 doctor-visits smoothed_adj_cli state pa day 2020-05-02 2020-05-09 7 3.225292 <NA> <NA> <NA> 0 5 5
130+
2 doctor-visits smoothed_adj_cli state pa day 2020-05-05 2020-05-12 7 2.779908 <NA> <NA> <NA> 0 5 5
131+
3 doctor-visits smoothed_adj_cli state pa day 2020-05-06 2020-05-13 7 2.557698 <NA> <NA> <NA> 0 5 5
132+
4 doctor-visits smoothed_adj_cli state pa day 2020-05-07 2020-05-14 7 2.191677 <NA> <NA> <NA> 0 5 5
133+
134+
**Note** that though this query requested all values between 2020-05-01 and
135+
2020-05-07, May 3rd and May 4th were *not* included in the results set. This is
136+
because the query will only include a result for May 3rd if a value were issued
137+
on May 10th (a 7-day lag), but in fact the value was not updated on that day:
138+
139+
>>> apicall = epidata.pub_covidcast(
140+
... data_source = "doctor-visits",
141+
... signals = "smoothed_adj_cli",
142+
... time_type = "day",
143+
... time_values = EpiRange("2020-05-03", "2020-05-03"),
144+
... geo_type = "state",
145+
... geo_values = "pa",
146+
... issues = EpiRange("2020-05-09", "2020-05-15")
147+
...)
148+
>>> apicall.df.head()
149+
source signal geo_type geo_value time_type time_value issue lag value stderr sample_size direction missing_value missing_stderr missing_sample_size
150+
0 doctor-visits smoothed_adj_cli state pa day 2020-05-03 2020-05-09 6 2.788618 <NA> <NA> <NA> 0 5 5
151+
1 doctor-visits smoothed_adj_cli state pa day 2020-05-03 2020-05-12 9 3.015368 <NA> <NA> <NA> 0 5 5
152+
2 doctor-visits smoothed_adj_cli state pa day 2020-05-03 2020-05-13 10 3.03931 <NA> <NA> <NA> 0 5 5
153+
3 doctor-visits smoothed_adj_cli state pa day 2020-05-03 2020-05-14 11 3.021245 <NA> <NA> <NA> 0 5 5
154+
4 doctor-visits smoothed_adj_cli state pa day 2020-05-03 2020-05-15 12 3.048725 <NA> <NA> <NA> 0 5 5

docs_smoke_test.py

Lines changed: 90 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,16 @@
11
from epidatpy import CovidcastEpidata, EpiDataContext, EpiRange
22
import pandas as pd
33

4+
# Set common options and context
5+
6+
pd.set_option('display.max_columns', None)
7+
pd.set_option('display.max_rows', None)
8+
pd.set_option('display.width', 1000)
9+
410
epidata = EpiDataContext(use_cache=True, cache_max_age_days=1)
11+
12+
# Getting started with epidatpy
13+
514
apicall = epidata.pub_covidcast(
615
data_source = "fb-survey",
716
signals = "smoothed_cli",
@@ -11,10 +20,6 @@
1120
time_values = EpiRange(20210405, 20210410))
1221
print(apicall)
1322

14-
pd.set_option('display.max_columns', None)
15-
pd.set_option('display.max_rows', None)
16-
pd.set_option('display.width', 1000)
17-
1823
data = apicall.df()
1924
print(data.head())
2025

@@ -72,4 +77,84 @@
7277

7378
data.plot(x="time_value", y="value", title="Smoothed CLI from Facebook Survey", xlabel="Date", ylabel="CLI")
7479
plt.subplots_adjust(bottom=.2)
75-
plt.show()
80+
plt.show()
81+
82+
# Signal discovery
83+
84+
epidata2 = CovidcastEpidata()
85+
sources = epidata2.source_df
86+
print(sources.head())
87+
88+
signals = epidata2.signal_df
89+
print(signals.head())
90+
91+
# Versioned data
92+
93+
apicall6 = epidata.pub_covidcast(
94+
data_source = "doctor-visits",
95+
signals = "smoothed_cli",
96+
time_type = "day",
97+
time_values = EpiRange("2020-05-01", "2020-05-01"),
98+
geo_type = "state",
99+
geo_values = "pa",
100+
as_of = "2020-05-07"
101+
)
102+
print(apicall6)
103+
104+
data6 = apicall6.df()
105+
print(data6.head())
106+
107+
apicall7 = epidata.pub_covidcast(
108+
data_source = "doctor-visits",
109+
signals = "smoothed_cli",
110+
time_type = "day",
111+
time_values = EpiRange("2020-05-01", "2020-05-01"),
112+
geo_type = "state",
113+
geo_values = "pa"
114+
)
115+
print(apicall7)
116+
117+
data7 = apicall7.df()
118+
print(data7.head())
119+
120+
apicall8 = epidata.pub_covidcast(
121+
data_source = "doctor-visits",
122+
signals = "smoothed_adj_cli",
123+
time_type = "day",
124+
time_values = EpiRange("2020-05-01", "2020-05-01"),
125+
geo_type = "state",
126+
geo_values = "pa",
127+
issues = EpiRange("2020-05-01", "2020-05-15")
128+
)
129+
print(apicall8)
130+
131+
data8 = apicall8.df()
132+
print(data8.head(7))
133+
134+
apicall9 = epidata.pub_covidcast(
135+
data_source = "doctor-visits",
136+
signals = "smoothed_adj_cli",
137+
time_type = "day",
138+
time_values = EpiRange("2020-05-01", "2020-05-07"),
139+
geo_type = "state",
140+
geo_values = "pa",
141+
lag = 7
142+
)
143+
print(apicall9)
144+
145+
data9 = apicall9.df()
146+
print(data9.head())
147+
148+
apicall10 = epidata.pub_covidcast(
149+
data_source = "doctor-visits",
150+
signals = "smoothed_adj_cli",
151+
time_type = "day",
152+
time_values = EpiRange("2020-05-03", "2020-05-03"),
153+
geo_type = "state",
154+
geo_values = "pa",
155+
issues = EpiRange("2020-05-09", "2020-05-15")
156+
)
157+
print(apicall10)
158+
159+
data10 = apicall10.df()
160+
print(data10.head())

0 commit comments

Comments
 (0)