The data can be found at the following URL https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip
Before executing the program, the zipped data file must be downloaded and unzipped into a directory.
Run_analysis.R is then executed from the directory containing the unzipped data file.
The unzipped input files are described as follows:
- subject_test.txt and subject_train.txt a list of subject id's used to match a subject to a set of observations.
- test_y.txt and train_x.txt a list of activities that match to the subject and the observations.
- test_x.txt and train_x.txt the recorded data for each subject and activity. This data is in the form of 561 recorded data observations.
- features.txt a list of 561 observation names used to title the columns for the observation results
- activity_labels.txt a list of meaningful names of activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING).
The output data record is in the created file tidyData.txt. Further details of the contents of this file can be found in the codebook.md file on this site.
Program execution is described in the following steps:
- For both test and train data, read into temporary variables the x, y, and subject text files as data records.
- From the features.txt file read in the column names that will be applied to the merged test and train data records.
- The column names need to be cleaned up as follows:
- Change "-" and "," characters into underscores to prevent these characters from causing problems with calculations
- In later stages the mean and standard deviation columns will be isolated from the rest of the data. In preparation for that, any columns with either "mean" or "std" anywhere in the column name will have the occurrence eliminated and "MEAN" or "STD" placed at the start of the column name.
**Please note that as a design element, I select any occurrences in the data of "mean" or "std". This results in 88 columns of data. **
-
Remove all "()"'s.
-
Remove all underscores. This is to clean up the extra underscores created earlier and to eliminate trailing underscores which may have come about by the column name manipulations.
-
Create two new column names - SubjectID and ActivityID which will be used in the merged data.
-
Combine the X, subject and Y data frames for both the training and test databases into new data frames.
-
Replace the default "Vx" column names with the descriptive versions created above.
-
Find all occurrences of "std" or "mean" in the column names and create a new dataframe that only contains those columns.
-
Replace the activity ID numbers with descriptive names gathered from the activity_labels.txt file.
-
Create a narrow data set using the SubjectID and ActivityID columns. This adds two additional columns ("variable" which is all the rest of the columns - the measurement column names and "value" which are the corresponding values).
-
Create a wide data set so that each entry of SubjectID and ActivityID have a row of observations of all the measured variables. The mean of all these observations is taken at this time.
-
Output the resulting file to "tidyData.txt" in the current working directory.
From the definition of a tidy data set given by Hadley Wickham, there are three important elements to consider in design:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table. (1)
In the resulting data set described above, item 3 is the easiest one to prove since the data by definition is the result of a single form of observation. It is all data collected from a set of same experiments conducted on multiple test subjects - all with the same activity set.
The command,
meltedTable <- melt(stdMeanData, id.vars = c("SubjectID", "ActivityID") )
creates a new data frame which contains all the original data in a narrow format with headers for subject, activity, variable and value. The variable field is all the remaining columns which represents the observations. The value field is the corresponding values of those observations.
The command,
summaryTable <- dcast(meltedTable, SubjectID + ActivityID ~ variable, mean)
creates a wide version of the meltedTable and does an average of the column variables.
The following table illustrates the tidy aspect of the data. For each combination of subjectID and activityID, there are all the corresponding observation variables.
| SubjectID | ActivityID | MEAN_tBodyAccX | MEAN_tBodyAccY |
|---|---|---|---|
| 1 | LAYING | 0.2215982 | -0.040513953 |
| 1 | SITTING | 0.2612376 | -0.001308288 |
| 1 | STANDING | 0.2789176 | -0.016137590 |
(1) Wickham, Hadley. Tidy Data, Journal of Statistical Software. Volume VV, Issue II