Create Data-wrangling.ipynb

shsarv · shsarv · commit 846799533124 · 2020-11-23T00:50:35.000+05:30
diff --git a/COVID19/notebooks/Data-wrangling.ipynb b/COVID19/notebooks/Data-wrangling.ipynb
@@ -0,0 +1,181 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Goal\n",
+    "My goal is to visualize various aspect of the `COVID-19` pandemic. In this notebook I describe how the data is acquired and processed."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Data sources"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "| Link | Source |\n",
+    "-------|---------\n",
+    "| https://github.com/CSSEGISandData/COVID-19 | JHU CSSE |\n",
+    "| [GDP per capita PPP](https://data.worldbank.org/indicator/NY.GDP.PCAP.PP.CD) | The World Bank\n",
+    "| [Population](https://data.worldbank.org/indicator/SP.POP.TOTL) | The World Bank\n",
+    "| [Urban Population](https://data.worldbank.org/indicator/SP.URB.TOTL.IN.ZS) | The World Bank\n",
+    "| [Population living in slums](https://data.worldbank.org/indicator/EN.POP.SLUM.UR.ZS) | The World Bank\n",
+    "| [Rural population](https://data.worldbank.org/indicator/SP.RUR.TOTL.ZS) | The World Bank\n",
+    "| [Life expectancy at birth](https://data.worldbank.org/indicator/SP.DYN.LE00.IN) | The World Bank\n",
+    "| [Current healthcare expenditure](https://data.worldbank.org/indicator/SH.XPD.CHEX.GD.ZS) | The World Bank\n",
+    "| https://datahub.io/JohnSnowLabs/country-and-continent-codes-list | Datahub"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The process of obtaining the data has been automated. See the `src/data` directory."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Data wrangling"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## COVID-19"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Original data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This dataset is downloaded from a `repository` on `github`.\n",
+    "The data about `COVID-19` cases is in `.csv` files where each region has a seperate row. We group the data by country and store each country in a different column. Cases that happened on boats are removed from the data.\n",
+    "\n",
+    "See the script `src/features/make_cases.py` for details."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Derived data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "From the original data about `COVID-19` cases we calculate what follows:\n",
+    "\n",
+    "* `mortality rate = dead / confirmed`\n",
+    "* `active cases = confirmed - recovered - dead`. \n",
+    "\n",
+    "We also extract a list of countries and apply the differencing operator to `confirmed` to extract the `daily change in cases` for each country."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## World Bank data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The data from the World Bank is downloaded using the `wbdata` library. The data includes is `Life expectancy` and `GDP per capita` to name a few. We extract the last known value of an indicator for a given county.\n",
+    "\n",
+    "See the script `src/features/make_world_bank.py` for details."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Continents"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In order to analyse the data by continent, we download a list of countries with continents and a list of countries with their respective 3 letter codes.\n",
+    "\n",
+    "See the script `src/features/make_continent.py` for details."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Summary"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "After preparing, cleaning and joining the downloaded datasets we store newly created `.csv` files in `data/processed` directory for further use. Here is table with a brief description of the contents of each file."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "| name | description |\n",
+    "|------|-------------|\n",
+    "| active_cases.csv | Calculation: `confirmed` - `recovered` - `dead`\n",
+    "| confirmed_cases.csv | Time series of confirmed cases from JHU CSSE.\n",
+    "| confirmed_cases_daily_change.csv | Daily change in confirmed cases, derived from JHU CSSE.\n",
+    "| confirmed_cases_since_t0.csv | Reindexed time series of confirmed cases.\n",
+    "| continents.csv | Countries mapped to continents.\n",
+    "| coordinates.csv | Country coordinates.\n",
+    "| country_stats.csv | Newest available case data by county.\n",
+    "| country_to_continent.csv | A mapping of countries to continents.\n",
+    "| dead_cases.csv | Time series of fatalities from JHU CSSE.\n",
+    "| mortality_rate.csv | Calculation: `dead` / `confirmed`, derived from JHU CSSE.\n",
+    "| recovered_cases.csv | Time series of recovered cases from JHU CSSE.\n",
+    "| world_bank.csv | Socioeconomic from the World Bank merged with data about covid.\n",
+    "| world_bank_codes.csv | 3 letter country codes from the World Bank."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}