-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPractical_ML_FinalProject.Rmd
175 lines (121 loc) · 5.25 KB
/
Practical_ML_FinalProject.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
---
title: "PracticalML_FinalProject"
author: "Pablo Rodriguez Chavez"
date: "March 25, 2018"
output: html_document
---
## Introduction
The objective of this project is to develop a Human Activity Recognition classification model that can infere the type of activity based on data from accelerometers.
Data was downloaded from the following url:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
We load the libraries caret, and dplyr. The latter contains utilities for data transformation, the former, caret, contains tools to train Machine Learning models and is a wrapper over several other libraries.
```{r, echo=FALSE, cache=TRUE}
library("caret")
library("dplyr")
```
# Loading and Cleaning
As a first step, we load the data and split it in training and testing samples.
```{r, echo=TRUE, cache=TRUE}
setwd("C://Users/Pablo/Dropbox/DataScience")
datos<-read.csv("pml-training.csv")
set.seed(20180329)
itrain<-createDataPartition(y=datos$classe,p=0.7, list=FALSE)
training<-datos[itrain,]
testing<-datos[-itrain,]
```
```{r,echo=TRUE, cache=TRUE}
dim(training)
```
We have 160 variables, 159 without the target.A quick look at our data shows that there are many variavles win a high number of missings. There are as well variables that by common sense have no causal relation with the outcome so we remove them.
```{r, echo=TRUE,cache=TRUE}
tr1<-training %>% select(-c(X,user_name, raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp, new_window, num_window))
```
We build the following function in order to find the variables that have a percentage of NA's above a given threshold.
```{r, ech=TRUE, cache=TRUE}
## quitamos variables con vacios
removeNaVars<-function(tabla,thres){
nr<-nrow(tabla)
nc<-ncol(tabla)
lista<-rep(FALSE,nc)
for(i in c(1:nc))
{
lista[i]<-ifelse(sum(is.na(tabla[,i]))/nr>thres, TRUE,FALSE)
}
return(lista)
}
```
The usage of this function is almost like nearZeroVar function in caret.
```{r,echo=TRUE,cache=TRUE}
nas<-removeNaVars(tr1,0.9)
table(nas)
tr2<-tr1[,!nas]
ncol(tr2)
```
There were 67 variables with more thatn 90% of missings, we will remove them.
```{r,echo=TRUE, cache=TRUE}
nzv <- nearZeroVar(tr2, saveMetrics= TRUE)
table(nzv$nzv)
tr3<-tr2[,!nzv$nzv]
```
Even though Tree based methods can handle near zero variance variables, we will remove them.
# Model Fitting
We chose to fit a boosting classifier, specifically the gradient boosting tree classifier, due to its good performance out of the box and its robustness in the precence of missings, sparse variables, etc.
## Training control and MetaParameter tuning
The control will be done using cross validation. Initially we we prepare a metaparameter grids for tuning, but it was very expensive computationaly and decided not to tune using a custom grid.
```{r,echo=TRUE,cache=TRUE}
fitControl <- trainControl(## 10-fold CV
method = "cv",
number = 10)
#
#xgbtree.grid<-expand.grid(nrounds = c(1, 10, 20),
# max_depth = c(1, 4),
# eta = c(.1, .4),
# gamma = 0,
# colsample_bytree = .7,
# min_child_weight = 1,
# subsample = c(.8, 1))
```
```{r,echo=TRUE,cache=TRUE}
```
## Model Fitting
The training is done using random gradient boosting with trees, as implemented in gbm
```{r,echo=TRUE,cache=TRUE}
modelo <- train(classe ~ .,
data = tr3 ,
method = "gbm",
trControl = fitControl,
# tuneGrid=xgbtree.grid,
verbose = FALSE,
na.action=na.pass)
plot(modelo)
```
As we see the model has very high accuracy. The following shows accuracy and kappa the ten folds
```{r, echo=TRUE, cache=TRUE}
modelo$resample
```
## Testing
Finally, even though then model has very good accuracy in the training, we think it is not due to overfitting.
We proceed to validate this against our hold out sample, which is 30% of the observations of the file "mpl-training.csv"
As we see in the confusion matrix, the accuracy is as good as we expected.
```{r,echo=TRUE,cache=TRUE}
pred <- predict(modelo, testing)
cmtx<-confusionMatrix(pred,testing$classe)
print(cmtx)
```
This is a table of predicted vs observed, 96% of the observation falls in the diagonal.
```{r, echo=TRUE, cache=TRUE}
table(pred,testing$classe)
```
Finally we predict the 20 testing cases in file "pml-testing.csv", we are saving them outside the github repo.
```{r,echo=TRUE,cache=TRUE}
to.predict<-read.csv("pml-testing.csv")
pred.test <- predict(modelo, to.predict)
write.csv(pred.test,"../predicciones.csv")
```
# Final Quizz
The model was able to correctly classify the 20 cases and got an score of 20/20 in the final Quizz.
# References
Ugulino, W., Cardador, D., Vega, K., Velloso, E., Milidiu, R., & Fuks, H. (2012). Wearable computing: Accelerometers' data classification of body postures and movements. In Advances in Artificial Intelligence-SBIA 2012 (pp. 52-61). Springer, Berlin, Heidelberg.
Ridgeway, G., & Ridgeway, M. G. (2004). The gbm package. R Foundation for Statistical Computing, Vienna, Austria, 5(3).
Kuhn, M. (2008). Caret package. Journal of statistical software, 28(5), 1-26.