|
| 1 | +### Introduction |
| 2 | + |
| 3 | +This Markdown file describes the variables, the data, and any transformations or work performed |
| 4 | +to clean up the original data in order to transform it into the final dataset. |
| 5 | + |
| 6 | +Note that the file [run_analysis.R](https://github.com/ramperher/ProgrammingAssignment3/blob/master/run_analysis.R) |
| 7 | +has already had all the steps followed in this project completely explained, but they will be |
| 8 | +reflected here for better understanding. |
| 9 | + |
| 10 | +### Tasks |
| 11 | + |
| 12 | +**1. Merge the training and the test sets to create one data set** |
| 13 | + |
| 14 | +* Download raw data and extract all files. It also provides information about the mean of each file |
| 15 | +and variable from the dataset (so we will not explain that here). |
| 16 | +* Load train/test data frames using read.table-cbind functions over all the files involved, which are: |
| 17 | + * Subject who performed the activity (from subject_train/test.txt). |
| 18 | + * Activity (from y_train/test.txt). |
| 19 | + * Measures (from X_train/test.txt). |
| 20 | +* Merge train and test data frame using rbind. We will obtain a data frame with the same number of |
| 21 | +variables as train and test data frames, while the number of observations will be the sum of the two. |
| 22 | + |
| 23 | +**2. Extract only the measurements on the mean and standard deviation for each measurement** |
| 24 | + |
| 25 | +* Read features.txt with read.table, which have the names for measures in X_train/text.txt, and |
| 26 | +transform them to a character vector. |
| 27 | +* Look for the position of names which contains "mean()" or "std()" and add them 2 in order to choose |
| 28 | +the correct columns in our data frame (remember that first and second column in the data frame are the |
| 29 | +subject and the activity). For this purpose, we can use grep function with a regular expression. |
| 30 | +* Update the data frame choosing the columns found before, in addition to the first and second column |
| 31 | +from the data frame (subject and activity). Now, our data frame has 68 variables (subject, activity, 33 |
| 32 | +mean variables and another 33 std variables). |
| 33 | + |
| 34 | +**3. Use descriptive activity names to name the activities in the data set** |
| 35 | + |
| 36 | +* Read activity_labels.txt, which have the names for every activity, and transform them to a character |
| 37 | +vector, as we did with features.txt. |
| 38 | +* Transform df second column (activity) into factor, using the previous character vector as levels. |
| 39 | + |
| 40 | +**4. Appropriately label the data set with descriptive variable names** |
| 41 | + |
| 42 | +* With colnames function, we will put names to all variables in the dataset. |
| 43 | + * First and second column will be called "subject" and "activity", respectively. |
| 44 | + * The rest of columns will use the names obtained in the task 2. |
| 45 | + |
| 46 | +**5. From the data set in step 4, create a second, independent tidy data set with the average of each variable for each activity and each subject** |
| 47 | + |
| 48 | +* Here we will need dplyr package with group_by/summarise_each functions. |
| 49 | + * Firstly, group by subject and activity. |
| 50 | + * After it, use summarise_each to compute the mean of the rest of variables. |
| 51 | +* For better understanding, we will rename the measure columns use the prefix "MEAN-". |
| 52 | +* Save the tidy data frame into a file called "tidy_df.txt" with write.table function. |
| 53 | + |
| 54 | +### Result |
| 55 | + |
| 56 | +The result is the tidy data frame called [tidy_df.txt](https://github.com/ramperher/ProgrammingAssignment3/blob/master/tidy_df.txt), |
| 57 | +which presents the following structure: |
| 58 | + |
| 59 | +* `subject` - (integer) subject who performed the activity for each window sample. Its |
| 60 | +range is from 1 to 30. |
| 61 | +* `activity` - (factor) activity performed by the subject. The possible values are WALKING, |
| 62 | +WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING and LAYING. |
| 63 | +* `MEAN-<variable>` - (numeric) mean of <variable> for every pair of subject-activity. The |
| 64 | +variables involved are all the mean and std variables from the original dataset. Its meaning can |
| 65 | +be found in the documentation of the original dataset. |
0 commit comments