|
| 1 | +Getting started with epidatpy |
| 2 | +============================= |
| 3 | + |
| 4 | +The epidatpy package provides access to all the endpoints of the `Delphi Epidata |
| 5 | +API <https://cmu-delphi.github.io/delphi-epidata/>`_, and can be used to make |
| 6 | +requests for specific signals on specific dates and in select geographic |
| 7 | +regions. |
| 8 | + |
| 9 | +Setup |
| 10 | +----- |
| 11 | + |
| 12 | +**Installation** |
| 13 | + |
| 14 | +You can install the stable version of this package from PyPi: |
| 15 | + |
| 16 | +>>> pip install epidatpy |
| 17 | + |
| 18 | +Or if you want the development version, install from GitHub: |
| 19 | + |
| 20 | +>>> pip install -e "git+https://github.com/cmu-delphi/epidatpy.git#egg=epidatpy" |
| 21 | + |
| 22 | +**API Keys** |
| 23 | + |
| 24 | +The Delphi API requires a (free) API key for full functionality. While most |
| 25 | +endpoints are available without one, there are |
| 26 | +`limits on API usage for anonymous users <https://cmu-delphi.github.io/delphi-epidata/api/api_keys.html>`_, |
| 27 | +including a rate limit. |
| 28 | + |
| 29 | +To generate your key, |
| 30 | +`register for a pseudo-anonymous account <https://api.delphi.cmu.edu/epidata/admin/registration_form>`_. |
| 31 | + |
| 32 | +*Note* that private endpoints (i.e. those prefixed with ``pvt_``) require a |
| 33 | +separate key that needs to be passed as an argument. These endpoints require |
| 34 | +specific data use agreements to access. |
| 35 | + |
| 36 | +Basic Usage |
| 37 | +----------- |
| 38 | + |
| 39 | +Fetching data from the Delphi Epidata API is simple. Suppose we are |
| 40 | +interested in the ``covidcast`` |
| 41 | +`endpoint <https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html>`_, |
| 42 | +which provides access to a |
| 43 | +`wide range of data <https://cmu-delphi.github.io/delphi-epidata/api/covidcast_signals.html>`_ |
| 44 | +on COVID-19. Reviewing the endpoint documentation, we see that we |
| 45 | +`need to specify <https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html#constructing-api-queries>`_ |
| 46 | +a data source name, a signal name, a geographic level, a time resolution, and |
| 47 | +the location and times of interest. |
| 48 | + |
| 49 | +The ``pub_covidcast`` function lets us access the ``covidcast`` endpoint: |
| 50 | + |
| 51 | +>>> from epidatpy import EpiDataContext, EpiRange |
| 52 | +>>> epidata = EpiDataContext(use_cache=True, cache_max_age_days=1) |
| 53 | +>>> # Obtain the most up-to-date version of the smoothed covid-like illness (CLI) |
| 54 | +>>> # signal from the COVID-19 Trends and Impact survey for the US |
| 55 | +>>> apicall = epidata.pub_covidcast( |
| 56 | +... data_source = "fb-survey", |
| 57 | +... signals = "smoothed_cli", |
| 58 | +... geo_type = "nation", |
| 59 | +... time_type = "day", |
| 60 | +... geo_values = "us", |
| 61 | +... time_values = EpiRange(20210405, 20210410)) |
| 62 | +EpiDataCall(endpoint=covidcast/, params={'data_source': 'fb-survey', 'signals': 'smoothed_cli', 'geo_type': 'nation', 'time_type': 'day', 'geo_values': 'us', 'time_values': '20210405-20210410'}) |
| 63 | + |
| 64 | +``pub_covidcast`` returns an ``EpiDataCall``, which can be further converted into different output formats - such as a Pandas DataFrame: |
| 65 | + |
| 66 | +>>> data = apicall.df() |
| 67 | +>>> data.head() |
| 68 | + source signal geo_type geo_value time_type time_value issue lag value stderr sample_size direction missing_value missing_stderr missing_sample_size |
| 69 | +0 fb-survey smoothed_cli nation us day 2021-04-05 2021-04-10 5 0.675832 0.014826 244046 <NA> 0 0 0 |
| 70 | +1 fb-survey smoothed_cli nation us day 2021-04-06 2021-04-11 5 0.690687 0.014998 242979 <NA> 0 0 0 |
| 71 | +2 fb-survey smoothed_cli nation us day 2021-04-07 2021-04-12 5 0.690664 0.015023 242153 <NA> 0 0 0 |
| 72 | +3 fb-survey smoothed_cli nation us day 2021-04-08 2021-04-13 5 0.706503 0.015236 241380 <NA> 0 0 0 |
| 73 | +4 fb-survey smoothed_cli nation us day 2021-04-09 2021-04-14 5 0.724306 0.015466 240256 <NA> 0 0 0 |
| 74 | + |
| 75 | +Each row represents one observation in the US on one |
| 76 | +day. The geographical abbreviation is given in the ``geo_value`` column, the date in |
| 77 | +the ``time_value`` column. Here `value` is the requested signal -- in this |
| 78 | +case, the smoothed estimate of the percentage of people with COVID-like |
| 79 | +illness, based on the symptom surveys, and ``stderr`` is its standard error. |
| 80 | + |
| 81 | +The Epidata API makes signals available at different geographic levels, |
| 82 | +depending on the endpoint. To request signals for all states instead of the |
| 83 | +entire US, we use the ``geo_type`` argument paired with ``*`` for the |
| 84 | +``geo_values`` argument. (Only some endpoints allow for the use of ``*`` to |
| 85 | +access data at all locations. Check the help for a given endpoint to see if |
| 86 | +it supports ``*``.) |
| 87 | + |
| 88 | +>>> apicall = epidata.pub_covidcast( |
| 89 | +... data_source = "fb-survey", |
| 90 | +... signals = "smoothed_cli", |
| 91 | +... geo_type = "state", |
| 92 | +... time_type = "day", |
| 93 | +... geo_values = "*", |
| 94 | +... time_values = EpiRange(20210405, 20210410)) |
| 95 | +EpiDataCall(endpoint=covidcast/, params={'data_source': 'fb-survey', 'signals': 'smoothed_cli', 'geo_type': 'state', 'time_type': 'day', 'geo_values': '*', 'time_values': '20210405-20210410'}) |
| 96 | +>>> apicall.df.head() |
| 97 | + source signal geo_type geo_value time_type time_value issue lag value stderr sample_size direction missing_value missing_stderr missing_sample_size |
| 98 | +0 fb-survey smoothed_cli state ak day 2021-04-05 2021-04-10 5 0.736883 0.275805 720.0 <NA> 0 0 0 |
| 99 | +1 fb-survey smoothed_cli state al day 2021-04-05 2021-04-10 5 0.796627 0.137734 3332.1117 <NA> 0 0 0 |
| 100 | +2 fb-survey smoothed_cli state ar day 2021-04-05 2021-04-10 5 0.561916 0.131108 2354.9911 <NA> 0 0 0 |
| 101 | +3 fb-survey smoothed_cli state az day 2021-04-05 2021-04-10 5 0.62283 0.105354 4742.2778 <NA> 0 0 0 |
| 102 | +4 fb-survey smoothed_cli state ca day 2021-04-05 2021-04-10 5 0.444169 0.040576 21382.3806 <NA> 0 0 0 |
| 103 | + |
| 104 | +We can fetch a subset of states by listing out the desired locations: |
| 105 | + |
| 106 | +>>> apicall = epidata.pub_covidcast( |
| 107 | +... data_source = "fb-survey", |
| 108 | +... signals = "smoothed_cli", |
| 109 | +... geo_type = "state", |
| 110 | +... time_type = "day", |
| 111 | +... geo_values = "pa,ca,fl", |
| 112 | +... time_values = EpiRange(20210405, 20210410)) |
| 113 | +EpiDataCall(endpoint=covidcast/, params={'data_source': 'fb-survey', 'signals': 'smoothed_cli', 'geo_type': 'state', 'time_type': 'day', 'geo_values': 'pa,ca,fl', 'time_values': '20210405-20210410'}) |
| 114 | +>>> apicall.df.head() |
| 115 | + source signal geo_type geo_value time_type time_value issue lag value stderr sample_size direction missing_value missing_stderr missing_sample_size |
| 116 | +0 fb-survey smoothed_cli state ca day 2021-04-05 2021-04-10 5 0.444169 0.040576 21382.3806 <NA> 0 0 0 |
| 117 | +1 fb-survey smoothed_cli state fl day 2021-04-05 2021-04-10 5 0.690415 0.058204 16099.0005 <NA> 0 0 0 |
| 118 | +2 fb-survey smoothed_cli state pa day 2021-04-05 2021-04-10 5 0.715758 0.072999 10894.0057 <NA> 0 0 0 |
| 119 | +3 fb-survey smoothed_cli state ca day 2021-04-06 2021-04-11 5 0.45604 0.04127 21176.3902 <NA> 0 0 0 |
| 120 | +4 fb-survey smoothed_cli state fl day 2021-04-06 2021-04-11 5 0.730692 0.059907 15975.0007 <NA> 0 0 0 |
| 121 | + |
| 122 | +We can also request data for a single location at a time, via the ``geo_values`` argument. |
| 123 | + |
| 124 | +>>> apicall = epidata.pub_covidcast( |
| 125 | +... data_source = "fb-survey", |
| 126 | +... signals = "smoothed_cli", |
| 127 | +... geo_type = "state", |
| 128 | +... time_type = "day", |
| 129 | +... geo_values = "pa,ca,fl", |
| 130 | +... time_values = EpiRange(20210405, 20210410)) |
| 131 | +EpiDataCall(endpoint=covidcast/, params={'data_source': 'fb-survey', 'signals': 'smoothed_cli', 'geo_type': 'state', 'time_type': 'day', 'geo_values': 'pa', 'time_values': '20210405-20210410'}) |
| 132 | +>>> apicall.df.head() |
| 133 | + source signal geo_type geo_value time_type time_value issue lag value stderr sample_size direction missing_value missing_stderr missing_sample_size |
| 134 | +0 fb-survey smoothed_cli state pa day 2021-04-05 2021-04-10 5 0.715758 0.072999 10894.0057 <NA> 0 0 0 |
| 135 | +1 fb-survey smoothed_cli state pa day 2021-04-06 2021-04-11 5 0.69321 0.070869 10862.0055 <NA> 0 0 0 |
| 136 | +2 fb-survey smoothed_cli state pa day 2021-04-07 2021-04-12 5 0.685934 0.070654 10790.0054 <NA> 0 0 0 |
| 137 | +3 fb-survey smoothed_cli state pa day 2021-04-08 2021-04-13 5 0.681511 0.071394 10731.0044 <NA> 0 0 0 |
| 138 | +4 fb-survey smoothed_cli state pa day 2021-04-09 2021-04-14 5 0.709416 0.072162 10590.0049 <NA> 0 0 0 |
| 139 | + |
| 140 | +Getting versioned data |
| 141 | +---------------------- |
| 142 | + |
| 143 | +The Epidata API stores a historical record of all data, including corrections |
| 144 | +and updates, which is particularly useful for accurately backtesting |
| 145 | +forecasting models. To fetch versioned data, we can use the ``as_of`` |
| 146 | +argument: |
| 147 | + |
| 148 | +>>> apicall = epidata.pub_covidcast( |
| 149 | +... data_source = "fb-survey", |
| 150 | +... signals = "smoothed_cli", |
| 151 | +... geo_type = "state", |
| 152 | +... time_type = "day", |
| 153 | +... geo_values = "pa,ca,fl", |
| 154 | +... time_values = EpiRange(20210405, 20210410), |
| 155 | +... as_of = "2021-06-01") |
| 156 | + |
| 157 | +Plotting |
| 158 | +-------- |
| 159 | + |
| 160 | +Because the output data is a standard Pandas DataFrame, we can easily plot |
| 161 | +it using any of the available Python libraries: |
| 162 | + |
| 163 | +>>> data.plot(x="time_value", y="value", title="Smoothed CLI from Facebook Survey", xlabel="Date", ylabel="CLI") |
| 164 | + |
| 165 | +.. image:: images/Figure_1.png |
| 166 | + :width: 800 |
| 167 | + :alt: Smoothed CLI from Facebook Survey |
| 168 | + |
| 169 | +Finding locations of interest |
| 170 | +----------------------------- |
| 171 | + |
| 172 | +Most data is only available for the US. Select endpoints report other countries at the national and/or regional levels. Endpoint descriptions explicitly state when they cover non-US locations. |
| 173 | + |
| 174 | +For endpoints that report US data, see the |
| 175 | +`geographic coding documentation <https://cmu-delphi.github.io/delphi-epidata/api/covidcast_geography.html>`_ |
| 176 | +for available geographic levels. |
| 177 | + |
| 178 | +International data |
| 179 | +------------------ |
| 180 | + |
| 181 | +International data is available via |
| 182 | + |
| 183 | +- ``pub_dengue_nowcast`` (North and South America) |
| 184 | +- ``pub_ecdc_ili`` (Europe) |
| 185 | +- ``pub_kcdc_ili`` (Korea) |
| 186 | +- ``pub_nidss_dengue`` (Taiwan) |
| 187 | +- ``pub_nidss_flu`` (Taiwan) |
| 188 | +- ``pub_paho_dengue`` (North and South America) |
| 189 | +- ``pvt_dengue_sensors`` (North and South America) |
| 190 | + |
| 191 | +Finding data sources and signals of interest |
| 192 | +-------------------------------------------- |
| 193 | + |
| 194 | +Above we used data from `Delphi’s symptom surveys <https://delphi.cmu.edu/covid19/ctis/>`_, |
| 195 | +but the Epidata API includes numerous data streams: medical claims data, cases |
| 196 | +and deaths, mobility, and many others. This can make it a challenge to find |
| 197 | +the data stream that you are most interested in. |
| 198 | + |
| 199 | +The Epidata documentation lists all the data sources and signals available |
| 200 | +through the API for `COVID-19 <https://cmu-delphi.github.io/delphi-epidata/api/covidcast_signals.html>`_ |
| 201 | +and for `other diseases <https://cmu-delphi.github.io/delphi-epidata/api/README.html#source-specific-parameters>`_. |
0 commit comments