Skip to content

rgdk/GettingAndCleaningData

Repository files navigation

#An explanation of the run_analysis.R script

##Part 1 ###Here the training and test sets are extracted into separate data frames:

  • The test and trainining activity data sets are then read into the test_data_activities and train_data_activities data frames
  • The column within each of the test_data_activities and train_data_activities data frames are renamed to 'activity_id'
  • The test and training subject data are then extracted into the test_data_subject and train_data_subject data frames
  • The column within each test_data_subject and train_data_subject data frames are renamed to 'subject_id'
  • The activity column labels are extracted and stored within the activity_labels data frame
  • The features column labels are extracted and stored within the features data frame
  • The columns in the activity_labels data frame are renamed to something more meaningful (activity_id and activity)
  • The columns in the features data frame are renamed to something more meaningful (feature_id and feature)

##Part 2 ###Here, descriptive activity names are used to name the activities in the data set

  • A column that contains the test activity description based on the activity_id is added to the test activity data
  • Then the id column is removed
  • A column that contains the training activity description based on the activity_id is added to the training activity data
  • Then the id column is removed
  • The test_data_subject data frame is merged with the test_data frame
  • The train_data_subject data frame is merged with the train_data frame
  • The activities data frame is then merged with the test_data frame
  • The activities data frame is also merged with the train_data frame
  • The test and train data frames are then concatenated

##Part 3 ###The data set is appropriately labelled with descriptive variable names.

  • The columns in the merged data set are renamed based on the feature data frame

##Part 4 ###Extract only the measurements on the mean and standard deviation for each measurement.

  • Data frames to hold the means columns and standard deviation columns are separately set up
  • The mean measure names only are derived from the existing features list and set as the rows for the mean_col_names data. This is based on mean-based measure containing 'mean()' in the name.
  • The standard deviation measure names only are derived from the existing features list and set as the rows for the stdev_col_names data. This is based on standard deviation-based measure containing 'sd()' in the name.
  • The mean and standard deviation measures are stored within separate data frames
  • Blank data frames are set up for each of the mean and standard deviation measures with the correct number of rows
  • The mean data are bound columns together
  • The sd data columns are bound together
  • The row_num column which was just an initial placeholder to establish the correct number of rows in the data frame is then removed from the column lists in each variable

##Part 5 ###A second, independent tidy data set with the average of each variable for each activity and each subject is then created.

  • The data.table package is included in the library
  • The resultant data set from part 3 is converted into a data.table so that we can perform some grouping calculations on the data
  • A variable is set up to include only the names of the columns for which the means are required
  • The means are calculated across all numeric columns and grouped by activity and subject
  • The data is then output to file as ordered data

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages