This is my collection of R scripts and supplementary files making up my solution to the project in the "Getting and Cleaning Data" unit of the "Data Science" course run by Johns Hopkins University on Coursera.
run_analysis.R is a system consisting of two script files that will download and extract raw data from a set of text files that are available from the internet. These files can be downloaded in a .zip file from https://d396qusza40orc.cloudfront.net/getdata_data_ss06hid.zip.
The run_analysis.R script will automatically download this file and extract the following text files into the raw subdirectory before running the main part of the analysis.
- activity_labels.txt
- features.txt
- features_info.txt
- subject_test.txt
- subject_train.txt
- X_test.txt
- X_train.txt
- y_test.txt
- y_train.txt
The files features.txt and features_info,txt constitute the codebook for this data.
On successful completion of running run_analysis.R the following two comma delimited text files will exist in the data directory:
cleaned_data.csv(andcleaned_data.RData) - contains full cleansed data setsummary_means.csv(andsummary_means.RData) - contains summary of the means of the above data set by activity and subject
Please see the codebook included in the repository for details on these two files.
To run this system you will need the following:
- An installed version of the R statistical system (preferably v3 or later)
- The 'plyr' R package by Hadley Wickham installed in your version of R
- The two .R script files located in this repository
run_analysis.Rand utilities.R - A reliable Internet connection
- access permission to create subfolders in your working directory
run_analysis.R consists of the functions that undertake the analyis itself
utilities.R consists of functions that support the analyis process such as downloading and extracting files
The decision was made to separate out the support functions into their own separate script file as it would be easier to reuse the code inside the file at a later stage.
Make sure that you have the 'run_analysis.Randutilities.R' installed in the working directory, then type in the following command at the prompt and press the key
source('run_analysis.R'); run_analysis()
The application script will then connect to the Internet, down and extract files, finishing of by analysing and creating the data files mentioned above in the 'data' directory.
You will see a series of information messages displayed on the screen while the scripts are undertaking their tasks. You will see a confirmation message at the end of processing.
On running the command run_analysis() at the prompt in your working directory, the scripts will:
-
Check to see if you have all the raw data files already extracted in the
rawsubdirectory -
If the answer to step 1 is yes then skip to step 4.
-
If files are missing
3.1 Create the
'rawanddatasubdirectories if they don't exist3.2 archive any existing raw data files (on confirmation from user)
3.3 download the
getdata_data_ss06hid.zipfrom the Internet website3.4 Extract the needed raw data files into the
rawsubdirectory -
if all is okay to continue then read each of the raw data files into it's own separate data frame
-
Create temporary variable names based on the details included in the
features.txtfile. Makes it easier to manipulate data later in the analysis -
Merge the test and training raw data files into one with the subject code and give the combined data set the temporary variable names.
-
Extract the mean and standard deviation columns and give new meaningful variable names
-
Write the cleansed data files out into the
datasubdirectory -
Calculate the average (mean) of each variable and write the summarised data set out into the
datadirectory
The decision was made to put the raw data and the cleansed data into their own subdirectories as this is good software engineering practise. It keeps the processed data separate from the raw data.
This application has a number of issues that should be resolved before it could be considered "production ready". Time constraints has meant that it has not been possible to resolve these issues.
Notably, these include:
-
Greater error checking support
-
Performance could be increased.
-
The data could be made tidier. I was looking to extract the method of analysis (mean / std deviation) out into their own variable as a factor as well as the type of data analysis done (ie. time domain, fast fourier, angle based) into their own factorial variable. I was investigating the use of the
reshape2package from Hadley Wickham to do this and whilst the melt function got me to the intermediate stage successfully, I was not able to cast it back into a wide data frame. Despite following several example online, the dcast always wanted to aggregate the data before a stage that I required in the processing. The code for this has been left in an unmerged git branch in an incomplete state.