Merge pull request #63 from 4DModeller/Data_preprocess_Tut

Add a new data pre-processing tutorial
4DModeller · Jul 21, 2023 · 5226607 · 5226607
2 parents d8ca4fa + b093bbe
commit 5226607
Showing 1 changed file with 75 additions and 0 deletions.
diff --git a/vignettes/data_preprocessing.Rmd b/vignettes/data_preprocessing.Rmd
@@ -0,0 +1,75 @@
+---
+title: "Data preprocessing"
+output: 
+  bookdown::html_document2:
+    base_format: rmarkdown::html_vignette
+    fig_caption: yes
+bibliography: "covid.bib"
+link-citations: yes
+vignette: >
+  %\VignetteIndexEntry{Data preprocessing}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+
+# Aim and data description
+
+In this tutorial, we will provide a short tutorial about data pre-processing, which involves transforming raw data into the desired format and object for running the Bayesian Hierarchical Model (BHM) in the fdmr package. To illustrate the process, we will use COVID-19 infection data as a practical example.
+
+In the [COVID-19 tutorial](https://4dmodeller.github.io/fdmr/articles/covid.html), we aim to fit a Bayesian spatio-temporal model to predict the COVID-19 infection rates across mainland England over space and time, and investigate the impacts of socioeconomic, demographic and environmental factors on COVID-19 infection. The study region is mainland England, which is partitioned into 6789 Middle Layer Super Output Areas (MSOAs). The raw shapefile of the study region is obtained from the [ONS Open Geography Portal](https://geoportal.statistics.gov.uk/datasets/ons::msoa-dec-2011-boundaries-super-generalised-clipped-bsc-ew-v3/explore?location=52.782096%2C-2.465779%2C7.81), which stores the location, shape and attributes of geographic features for the MSOAs.
+
+First we'll retrieve the tutorial dataset `preprocessing`.
+
+```{r}
+fdmr::retrieve_tutorial_data(dataset = "preprocessing")
+```
+
+Then we load in the shapefile into R, and store it in an object named `sp_data`.
+
+```{r shapefile, warning=FALSE,error=FALSE}
+shapefilepath <- fdmr::get_tutorial_datapath(dataset = "preprocessing", filename = "MSOA_(Dec_2011)_Boundaries_Super_Generalised_Clipped_(BSC)_EW_V3.shp")
+sp_data <- rgdal::readOGR(dsn = shapefilepath)
+```
+
+The type of the object `sp_data` is a SpatialPolygonsDataFrame. 
+
+```{r classsp}
+class(sp_data)
+```
+
+Then we retrieve the projection attributes of the shapefile, i.e., `sp_data`, and transform it from its original coordinate reference system (CRS) to a new CRS, which is the World Geodetic System 1984 (WGS84).
+
+```{r transCRS}
+sp::proj4string(sp_data)
+sp_data <- sp::spTransform(sp_data, sp::CRS("+proj=longlat +datum=WGS84 +no_defs"))
+```
+
+In the COVID-19 tutorial, the raw COVID-19 infections data and the related covariate data were obtained from the official UK Government COVID-19 dashboard and the Office for National Statistics (ONS). The data were initially downloaded, organised and saved in a CSV file format. The CSV file can be imported into R using the `utils::read.csv()` function. 
+
+
+```{r loadCOVIDat}
+covid19_data_filepath <- fdmr::get_tutorial_datapath(dataset = "preprocessing", filename = "covid19_data.csv")
+covid19_data <- utils::read.csv(file = covid19_data_filepath)
+```
+
+The type of the object `covid19_data` is a data.frame. 
+
+
+```{r classdat}
+class(covid19_data)
+```
+
+The first 6 rows of the data set can be viewed using the following code
+
+```{r checkdat}
+utils::head(covid19_data)
+```
+
+The data frame contains 23 columns. `MSOA11CD` represents the spatial identifier for each data observation. Variable `cases` is the response variable, which is the weekly reported number of COVID-19 cases in each of the 6789 MSOAs in main England over the period from 2022-01-01 to 2022-03-26. Variable `date` indicates the start date of each observation week when the COVID-19 infections data for each MSOA were reported. Variable `week` indicates the week index number that each data observation was collected from. Columns `LONG` and `LAT` indicate the longitude and latitude for each MSOA. Variable `Population` indicates the population size for each MSOA. The remaining columns store the data for each covariate in each MSOA and week. 
+
+
+Therefore, the expected observation and measurement data format for a spatio-temporal Bayesian hierarchical model as in the COVID-19 tutorial should be a data frame that includes one column for the response data (e.g., `cases`), two columns for the spatial location of each observation (e.g., `LONG` and `LAT`), and one column containing time point indices indicating when each observation was collected (e.g., `week` = 1, 2, ...). If the model incorporates covariates, then the covariate data should also be included in the same data frame, and each covariate is stored in one column. Users can use any variable names for the columns, as long as they ensure consistency with those used when defining the model formula and fitting the model.  
+
+
+With `sp_data` and `covid19_data` in the expected data object and format, we now possess all the essential information required for the fitting the BHM and visualising the results. More details regarding the model fitting process can be found at in the [COVID-19 tutorial](https://4dmodeller.github.io/fdmr/articles/covid.html).