Skip to content

Commit 730fc03

Browse files
authored
Create CodeBook.md
1 parent 054d6ee commit 730fc03

File tree

1 file changed

+65
-0
lines changed

1 file changed

+65
-0
lines changed

CodeBook.md

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
### Introduction
2+
3+
This Markdown file describes the variables, the data, and any transformations or work performed
4+
to clean up the original data in order to transform it into the final dataset.
5+
6+
Note that the file [run_analysis.R](https://github.com/ramperher/ProgrammingAssignment3/blob/master/run_analysis.R)
7+
has already had all the steps followed in this project completely explained, but they will be
8+
reflected here for better understanding.
9+
10+
### Tasks
11+
12+
**1. Merge the training and the test sets to create one data set**
13+
14+
* Download raw data and extract all files. It also provides information about the mean of each file
15+
and variable from the dataset (so we will not explain that here).
16+
* Load train/test data frames using read.table-cbind functions over all the files involved, which are:
17+
* Subject who performed the activity (from subject_train/test.txt).
18+
* Activity (from y_train/test.txt).
19+
* Measures (from X_train/test.txt).
20+
* Merge train and test data frame using rbind. We will obtain a data frame with the same number of
21+
variables as train and test data frames, while the number of observations will be the sum of the two.
22+
23+
**2. Extract only the measurements on the mean and standard deviation for each measurement**
24+
25+
* Read features.txt with read.table, which have the names for measures in X_train/text.txt, and
26+
transform them to a character vector.
27+
* Look for the position of names which contains "mean()" or "std()" and add them 2 in order to choose
28+
the correct columns in our data frame (remember that first and second column in the data frame are the
29+
subject and the activity). For this purpose, we can use grep function with a regular expression.
30+
* Update the data frame choosing the columns found before, in addition to the first and second column
31+
from the data frame (subject and activity). Now, our data frame has 68 variables (subject, activity, 33
32+
mean variables and another 33 std variables).
33+
34+
**3. Use descriptive activity names to name the activities in the data set**
35+
36+
* Read activity_labels.txt, which have the names for every activity, and transform them to a character
37+
vector, as we did with features.txt.
38+
* Transform df second column (activity) into factor, using the previous character vector as levels.
39+
40+
**4. Appropriately label the data set with descriptive variable names**
41+
42+
* With colnames function, we will put names to all variables in the dataset.
43+
* First and second column will be called "subject" and "activity", respectively.
44+
* The rest of columns will use the names obtained in the task 2.
45+
46+
**5. From the data set in step 4, create a second, independent tidy data set with the average of each variable for each activity and each subject**
47+
48+
* Here we will need dplyr package with group_by/summarise_each functions.
49+
* Firstly, group by subject and activity.
50+
* After it, use summarise_each to compute the mean of the rest of variables.
51+
* For better understanding, we will rename the measure columns use the prefix "MEAN-".
52+
* Save the tidy data frame into a file called "tidy_df.txt" with write.table function.
53+
54+
### Result
55+
56+
The result is the tidy data frame called [tidy_df.txt](https://github.com/ramperher/ProgrammingAssignment3/blob/master/tidy_df.txt),
57+
which presents the following structure:
58+
59+
* `subject` - (integer) subject who performed the activity for each window sample. Its
60+
range is from 1 to 30.
61+
* `activity` - (factor) activity performed by the subject. The possible values are WALKING,
62+
WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING and LAYING.
63+
* `MEAN-<variable>` - (numeric) mean of <variable> for every pair of subject-activity. The
64+
variables involved are all the mean and std variables from the original dataset. Its meaning can
65+
be found in the documentation of the original dataset.

0 commit comments

Comments
 (0)