This project is part of the Delaware North Data and Analytics Project. The goal of this project is to analyze COVID-19 testing data in the United States and provide insights into metrics related to testing and patient outcomes.
The US Department of Health and Human Services provides federal-level collection and publishing of COVID-19 testing and patient outcome data. The project aims to address the following metrics:
- The total number of PCR tests performed as of yesterday in the United States.
- The 7-day rolling average number of new cases per day for the last 30 days.
- The 10 states with the highest test positivity rate (positive tests / tests performed) for tests performed in the last 30 days.
- As of yesterday, there have been 1,043,290,261 PCR tests reported in the United States. This number was found by taking the sum of all entries in the
new_results_reportedcolumn of the dataset. Each new result represents a PCR test, regardless of the outcome (positive, negative, or uncertain).
- The 7-day rolling average number of new cases per day for the last 30 days shows promising results. The rolling average appears to be decreasing heavily, particularly since the first week of May. In order to find the rolling average of new cases per day, I first filtered the data to find all positive cases. I then grouped the data by date and summed up the
new_results_reportedcolumn to get a count of how many positive cases there were for each day for the last 30 days. Next, I used the .rolling pandas method to calculate the average number of cases using a 7-day window and retrieved the last 30 days from the resulting dataset to get my results.
3. The 10 states with the highest test positivity rate (positive tests / tests performed) for tests performed in the last 30 days.
-
To find this metric, I first retrieved a date 30 days prior to the current date. The code is dynamic, so it will subtract 30 days from the day that the code is run using a timedelta operation. I used that retrieved date to create a new DataFrame,
df3, to filter data such that it only retrieves dates greater than or equal to the last 30 days date, effectively returning reports within the last 30 days. -
Next, I grouped the data in
df3by state, and ran an aggregate operation to sum up the values in thenew_results_reportedcolumn to get the total number of tests performed for each state. -
In a separate DataFrame,
df3_positive, I filtered the data to only include positive cases and then did the same grouping and aggregation done in the previous step. This resulted in a DataFrame with a sum of total positive tests for each state. -
Finally, I combined the DataFrames so that the resulting DataFrame had the columns
state_name,total_positive_tests, andtotal_cases. I then simply divided the total positive tests by the total cases for each state to calculate the positivity rate. -
Sorting by positivity rate in descending order results in the following output:
There are many caveats that should be addressed with this project. Please keep in mind the following notes when interpreting the results:
-
Traditionally, the United States has 50 states. Although this data was provided by the United States government, there are 56 unique states listed in the dataset. The six additional states include:
- District of Columbia
- Guam
- Marshall Islands
- Northern Mariana Islands
- Puerto Rico
- U.S. Virgin Islands
-
Each of these entities has some level of political representation with the United States which is why they are included in this dataset.
- Many states within the United States are small in population size relative to other states. Similarly, the six extra "states" previously mentioned have very small population sizes. States with smaller populations are more prone to outliers or skewed data since there is less data to analyze. For example, the U.S. Virgin Islands has the highest positivity rate for tests performed in the last 30 days, but there have only been 177 tests performed in that time period with 50 total positive tests, resulting in a positivity rate of 28%, 10% higher than the second highest state. Interestingly, most states in the top 10 for positivity rate have somewhat smaller population sizes like South Dakota, Guam, Wyoming, Hawaii, and New Mexico.
- Covid-19 was rampant in 2020 and 2021, and so Covid testing sites were widespread, easily accessible, and strongly encouraged. As such, many Americans were getting tested for Covid proactively (especially because the virus could be present before symptoms even appear). This resulted in a lower positivity rate since a large population of the country was testing proactively, even when they felt healthy. Now in 2023, the pandemic has waned, and the majority of the population has been vaccinated. Therefore, testing has been decreasing and those that do test are more likely to test reactively rather than proactively.
The repository has the following structure:
get_data.py: This script pulls the data and puts it intodataset.csv.dataset.csv: This file is generated using theget_data.pyscript and contains the COVID-19 testing data.analysis.ipynb: This Jupyter Notebook contains the exploratory data analysis and answers to the project's questions.requirements.txt: This file lists the project dependencies.
To reproduce the project results, follow these steps:
-
Run the
get_data.pyscript to pull the data and generate thedataset.csvfile. -
Install the required dependencies specified in the
requirements.txtfile by running the following command:pip install -r requirements.txt
-
Run the analysis.ipynb notebook in Jupyter Notebook or JupyterLab. This notebook contains the data analysis and addresses the project's metrics.
The data used for this project is provided by the US Department of Health and Human Services and can be found here.


