DataCleaning

Running the program

The data can be found at the following URL https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip

Before executing the program, the zipped data file must be downloaded and unzipped into a directory.

Run_analysis.R is then executed from the directory containing the unzipped data file.

Structure of Database Files

The unzipped input files are described as follows:

subject_test.txt and subject_train.txt a list of subject id's used to match a subject to a set of observations.
test_y.txt and train_x.txt a list of activities that match to the subject and the observations.
test_x.txt and train_x.txt the recorded data for each subject and activity. This data is in the form of 561 recorded data observations.
features.txt a list of 561 observation names used to title the columns for the observation results
activity_labels.txt a list of meaningful names of activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING).

The output data record is in the created file tidyData.txt. Further details of the contents of this file can be found in the codebook.md file on this site.

Program Details

Program execution is described in the following steps:

For both test and train data, read into temporary variables the x, y, and subject text files as data records.
From the features.txt file read in the column names that will be applied to the merged test and train data records.
The column names need to be cleaned up as follows:
Change "-" and "," characters into underscores to prevent these characters from causing problems with calculations
In later stages the mean and standard deviation columns will be isolated from the rest of the data. In preparation for that, any columns with either "mean" or "std" anywhere in the column name will have the occurrence eliminated and "MEAN" or "STD" placed at the start of the column name.

**Please note that as a design element, I select any occurrences in the data of "mean" or "std". This results in 88 columns of data. **

Remove all "()"'s.
Remove all underscores. This is to clean up the extra underscores created earlier and to eliminate trailing underscores which may have come about by the column name manipulations.
Create two new column names - SubjectID and ActivityID which will be used in the merged data.
Combine the X, subject and Y data frames for both the training and test databases into new data frames.
Replace the default "Vx" column names with the descriptive versions created above.
Find all occurrences of "std" or "mean" in the column names and create a new dataframe that only contains those columns.
Replace the activity ID numbers with descriptive names gathered from the activity_labels.txt file.
Create a narrow data set using the SubjectID and ActivityID columns. This adds two additional columns ("variable" which is all the rest of the columns - the measurement column names and "value" which are the corresponding values).
Create a wide data set so that each entry of SubjectID and ActivityID have a row of observations of all the measured variables. The mean of all these observations is taken at this time.
Output the resulting file to "tidyData.txt" in the current working directory.

Is the resulting data set a tidy data set?

From the definition of a tidy data set given by Hadley Wickham, there are three important elements to consider in design:

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table. (1)

In the resulting data set described above, item 3 is the easiest one to prove since the data by definition is the result of a single form of observation. It is all data collected from a set of same experiments conducted on multiple test subjects - all with the same activity set.

The command,

 meltedTable <- melt(stdMeanData, id.vars = c("SubjectID", "ActivityID") )

creates a new data frame which contains all the original data in a narrow format with headers for subject, activity, variable and value. The variable field is all the remaining columns which represents the observations. The value field is the corresponding values of those observations.

The command,

summaryTable <- dcast(meltedTable, SubjectID + ActivityID ~ variable, mean)

creates a wide version of the meltedTable and does an average of the column variables.

The following table illustrates the tidy aspect of the data. For each combination of subjectID and activityID, there are all the corresponding observation variables.

SubjectID	ActivityID	MEAN_tBodyAccX	MEAN_tBodyAccY
1	LAYING	0.2215982	-0.040513953
1	SITTING	0.2612376	-0.001308288
1	STANDING	0.2789176	-0.016137590

References

(1) Wickham, Hadley. Tidy Data, Journal of Statistical Software. Volume VV, Issue II

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
CodeBook.md		CodeBook.md
README.md		README.md
run_analysis.R		run_analysis.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataCleaning

Running the program

Structure of Database Files

Program Details

Is the resulting data set a tidy data set?

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DataCleaning

Running the program

Structure of Database Files

Program Details

Is the resulting data set a tidy data set?

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages