Practical_ML_FinalProject.Rmd

---
title: "PracticalML_FinalProject"
author: "Pablo Rodriguez Chavez"
date: "March 25, 2018"
output: html_document
---

## Introduction

The objective of this project is to develop a Human Activity Recognition classification model that can infere the type of activity based on data from accelerometers.

Data was downloaded from the following url:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

We load the libraries caret, and dplyr. The latter contains utilities for data transformation, the former, caret, contains tools to train Machine Learning models and is a wrapper over several other libraries.

```{r, echo=FALSE, cache=TRUE}
library("caret")
library("dplyr")

```

# Loading and Cleaning

As a first step, we load the data and split it in training and testing samples.

```{r, echo=TRUE, cache=TRUE}

setwd("C://Users/Pablo/Dropbox/DataScience")
datos<-read.csv("pml-training.csv")

set.seed(20180329)

itrain<-createDataPartition(y=datos$classe,p=0.7, list=FALSE)

training<-datos[itrain,]
testing<-datos[-itrain,]

```


```{r,echo=TRUE, cache=TRUE}
dim(training)
```


We have 160 variables, 159 without the target.A quick look at our data shows that there are many variavles win a high number of missings. There are as well variables that by common sense have no causal relation with the outcome so we remove them.
```{r, echo=TRUE,cache=TRUE}
tr1<-training %>% select(-c(X,user_name, raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp, new_window, num_window))
```

We build the following function in order to find the variables that have a percentage of NA's above a given threshold. 

```{r, ech=TRUE, cache=TRUE}

## quitamos variables con vacios
removeNaVars<-function(tabla,thres){
  nr<-nrow(tabla)
  nc<-ncol(tabla)
  lista<-rep(FALSE,nc)
  for(i in c(1:nc))
  {
    lista[i]<-ifelse(sum(is.na(tabla[,i]))/nr>thres, TRUE,FALSE)
  }
  return(lista)
}
```

The usage of this function is almost like nearZeroVar function in caret.
```{r,echo=TRUE,cache=TRUE}
nas<-removeNaVars(tr1,0.9)
table(nas)
tr2<-tr1[,!nas]
ncol(tr2)
```

There were 67 variables with more thatn 90% of missings, we will remove them.

```{r,echo=TRUE, cache=TRUE}
nzv <- nearZeroVar(tr2, saveMetrics= TRUE)
table(nzv$nzv)
tr3<-tr2[,!nzv$nzv]
```

Even though Tree based methods can handle near zero variance variables, we will remove them.

# Model Fitting

We chose to fit a boosting classifier, specifically the gradient boosting tree classifier, due to its good performance out of the box and its robustness in the precence of missings, sparse variables, etc.


## Training control and MetaParameter tuning

The control will be done using cross validation. Initially we we prepare a metaparameter grids for tuning, but it was very expensive computationaly and decided not to tune using a custom grid. 

```{r,echo=TRUE,cache=TRUE}

fitControl <- trainControl(## 10-fold CV
  method = "cv",
  number = 10)
#
#xgbtree.grid<-expand.grid(nrounds = c(1, 10, 20),
#                          max_depth = c(1, 4),
#                          eta = c(.1, .4),
#                          gamma = 0,
#                          colsample_bytree = .7,
#                          min_child_weight = 1,
#                          subsample = c(.8, 1))
```


```{r,echo=TRUE,cache=TRUE}
```


## Model Fitting

The training is done using random gradient boosting with trees, as implemented in gbm
```{r,echo=TRUE,cache=TRUE}
modelo <- train(classe ~ ., 
                data = tr3 , 
                method = "gbm", 
                trControl = fitControl,
               # tuneGrid=xgbtree.grid, 
                verbose = FALSE,
                na.action=na.pass)
plot(modelo)
```

 As we see the model has very high accuracy. The following shows accuracy and kappa the ten folds
 
 ```{r, echo=TRUE, cache=TRUE}
 
 modelo$resample
 
 ```

## Testing

Finally, even though then model has very good accuracy in the training, we think it is not due to overfitting. 

We proceed to validate this against our hold out sample, which is 30% of the observations of the file "mpl-training.csv"

As we see in the confusion matrix, the accuracy is as good as we expected.
```{r,echo=TRUE,cache=TRUE}
pred <- predict(modelo, testing)
cmtx<-confusionMatrix(pred,testing$classe)
print(cmtx)
```
This is a table of predicted vs observed, 96% of the observation falls in the diagonal.
```{r, echo=TRUE, cache=TRUE}
table(pred,testing$classe)
```

Finally we predict the 20 testing cases in file "pml-testing.csv", we are saving them outside the github repo.

```{r,echo=TRUE,cache=TRUE}
to.predict<-read.csv("pml-testing.csv")
pred.test <- predict(modelo, to.predict)
write.csv(pred.test,"../predicciones.csv")
```
# Final Quizz
The model was able to correctly classify the 20 cases and got an score of 20/20 in the final Quizz.

# References

Ugulino, W., Cardador, D., Vega, K., Velloso, E., Milidiu, R., & Fuks, H. (2012). Wearable computing: Accelerometers' data classification of body postures and movements. In Advances in Artificial Intelligence-SBIA 2012 (pp. 52-61). Springer, Berlin, Heidelberg.

Ridgeway, G., & Ridgeway, M. G. (2004). The gbm package. R Foundation for Statistical Computing, Vienna, Austria, 5(3).

Kuhn, M. (2008). Caret package. Journal of statistical software, 28(5), 1-26.