-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathanalysis_in_r.rmd
405 lines (267 loc) · 15.1 KB
/
analysis_in_r.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
***
## Problem description
Problem we are trying to handle consist of combining historical usage patterns with weather data in order to forecast bike rental demand in the [Capital Bikeshare](http://www.capitalbikeshare.com/system-data) program in Washington, D.C. Bike sharing systems are a means of renting bicycles in an automated way via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis.
The problem has been posted on [Kaggle](https://www.kaggle.com) and more information about the comptetion you can find [here](https://www.kaggle.com/c/bike-sharing-demand).
Data set is comprised of the first 19 days of each month from 2011 to 2012. The objective of the comptetion was to predict the total number of bike rents for each hours.
**Data Fields**
* datetime - hourly date + timestamp;
* season - 1 = spring, 2 = summer, 3 = fall, 4 = winter;
* holiday - whether the day is considered a holiday;
* workingday - whether the day is neither a weekend nor holiday;
* weather:
+ 1: Clear, Few clouds, Partly cloudy, Partly cloudy;
+ 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist;
+ 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds;
+ 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog.
* temp - temperature in Celsius;
* atemp - "feels like" temperature in Celsius;
* humidity - relative humidity;
* windspeed - wind speed;
* **casual** - number of non-registered user rentals initiated;
* **registered** - number of registered user rentals initiated;
* **count** - number of total rentals.
***
## Setting Up: Load the packages and prepare input data
First of all, we are going to install all the required packages unless you didn't do this before.
```{r warning=FALSE, eval=FALSE}
packs = c("e1071", "PerformanceAnalytics", "Hmisc", "corrplot", "DMwR",
"maggrittr", "tidyr", "dplyr", "randomForest")
install.packages(packs, dependencies = TRUE)
```
Once installed the required packages we need to load it to currant environment.
```{r warning=FALSE, eval=TRUE, message=FALSE}
library(e1071) # for skewness
library(PerformanceAnalytics) # for correlogram (chart.Correlation)
library(Hmisc) # for missing value treatement (impute)
library(corrplot) # for correlogram (corrplot)
library(DMwR) # for knnImputation
library(magrittr) # pipeline operation
library(tidyr) # data tiding
library(dplyr) # data processing functions
library(randomForest) # apply random forest algorithm
```
There are a lot of ways to import data into R. The most popular one is reading from a `csv` file. We can do it as follow:
```{r echo=FALSE}
# invisible code to set working directory
# set_path_to_your_working_directory = "/home/togrul/Dropbox/Data Challenge/CEDAWI Presentation/Bike problem"
set_path_to_your_working_directory = "C:\\Users\\t.jafarov\\Dropbox\\Data Challenge\\CEDAWI Presentation\\Bike problem"
```
```{r eval=TRUE, warning=FALSE}
# set your working directory
setwd(set_path_to_your_working_directory)
inputData = read.csv("data/train.csv", header = TRUE, stringsAsFactors = FALSE)
```
Now, we have imported data into R, first thing to do as a data analyst is to try to have some idea about data. `R` has very usefull functions to handle that. Let's first check the `summary` function, which will give us `descriptive statistics` of data set.
```{r warning=FALSE}
summary(inputData)
```
Another important information about data set is `data types`:
```{r warning=FALSE}
str(inputData)
```
***
## Creating derived variables
Before moving on to exploratory analysis, it is a good idea to derive as many new variables from your predictors, that you think would be relevant for your dependent variable. In this way we can capture all the nuances that might be an explanatory cause for the dependent variable.
Along this analysis we are going to use `%>%` pipeline operation from `magrittr` package, which makes code easy to read and easy to write. We are going to split `datetime` variable in two differant ways.
Column splitting with `default operations` :
```{r warning=FALSE, eval=FALSE}
inputData = separate(inputData, col = datetime, into = c("date", "time"), sep = " ", remove = FALSE)
inputData = separate(inputData, col = date, into = c("year", "month", "day"), sep = "-", remove = TRUE)
inputData = separate(inputData, col = time, into = c("hour", "minute", "second"), sep = ":", remove = TRUE)
```
Column splitting with `pipeline operation`:
```{r warning=FALSE}
inputData = inputData %>%
separate(col = datetime, into = c("date", "time"), sep = " ", remove = FALSE) %>%
separate(col = date, into = c("year", "month", "day"), sep = "-", remove = TRUE) %>%
separate(col = time, into = c("hour", "minute", "second"), sep = ":", remove = TRUE)
```
Another usefull function in `R` is `head`, which shows you first (6) rows of data.
```{r}
head(inputData)
```
**Differant types of variables.**
A **variable** is a property that can take on many values. For example: `temp` is variable. It can take on many different values, such as 2.4, 3.0 so on. `season` is a variable. It can take 4 different values, 1,2,3 or 4. Two kinds of varaibles:
* **continue** variables are measured numerically. With continue variables we can doo things like add, substract, multiply and divide and get meaningful information (in `temp` example 4.0 is higher than 3.0 by 1.0).
* **categorical** variables allow for classification based on some characteristic. We can not do thinks that we could do with continious variables (in the example of `season` we can not say that season 2 is two times than season 1).
Data types are important in R.
```{r warning=FALSE}
# convert string to datetime data type
inputData$datetime = strptime(x = inputData$datetime, format = "%Y-%m-%d %H:%M:%S", tz = "UTC")
# get categorical variables
fcts = c("month", "hour", "season", "holiday", "workingday", "day", "weather")
# set string to categorical data type
for(i in 1:length(fcts)) inputData[,fcts[i]] = as.factor(inputData[,fcts[i]])
```
Let's look at the `categorical varaibales` in detail.
```{r}
sapply(inputData[,c("month", "hour", "season", "holiday", "workingday", "day", "weather")], summary) # get summary statistics for all columns
```
We can mention that `holiday=1` is occurs only `<3%` of the time in the data set. On the other hand, we can explain holidays with workingdays as well. So we can eliminate it from further analysis.
```{r eval=FALSE}
inputData_cat$holiday = NULL
```
Another interesting point is `weather` variable. `weather=1` takes an important part of data set and other 3 seasons are minority so we can regroup them into `season=1` and `season=2` as follows:
```{r}
inputData = within(inputData, {
weather_grouped=ifelse(weather == 1, 1, 2)
})
# set string to categorical data type
inputData$weather_grouped = as.factor(inputData$weather_grouped)
```
Segregate all continuous and categorical variables into differant variables.
```{r warning=FALSE, message=FALSE}
# Place all continuous vars in inputData_cont
inputData_cont <- inputData[, c("temp", "humidity", "windspeed")]
# Place all categorical variables in inputData_cat
inputData_cat <- inputData[, c("month", "hour", "season", "workingday", "weather_grouped")]
# Place day variable seperately
inputData_day = inputData$day
```
During this presentation we are going to forecast `registered` users only.
```{r warning=FALSE, message=FALSE}
# create the response data frame
inputData_response <- data.frame(registered=inputData[, "registered"])
response_name <- "registered" # name of response variable
response <- inputData[, response_name] # response variable as a vector
```
***
## Exploratory analysis
Now, let's generate different types of plots: Density, Scatter, Box plots in order to have more information about continious variables itself and interaction with response variables, which is`registered` bike users. Following code allows you automaticly generate three types of plots and save it into seperate `.png` files.
```{r warning=FALSE, message=FALSE}
for (k in names(inputData_cont)){
png(file=paste(k,"_dens_scatter_box" ,".png", sep=""), width=900, height=550)
x <- as.numeric (inputData_cont[, k])
Skewness <- round(skewness(x), 2) # calc skewness
dens <- density(x, na.rm=T) # density func
par(mfrow=c(1, 3)) # setup plot-area in 3 columns
# Density plot
plot(dens, type="l", col="red", ylab="Frequency", xlab = k, main = paste(k, ": Density Plot"), sub=paste("Skewness: ", Skewness))
polygon(dens, col="red")
# scatterplot
plot(x, response, col="blue", ylab="Response", xlab = k, main = paste(k, ": Scatterplot"), pch=20)
abline(response ~ x)
# boxplot
boxplot(x, main=paste(k, ": Box Plot"), sub=paste("Outliers: ", paste(boxplot.stats(x)$out, collapse=" ")))
dev.off()
}
```
The above code will create the density plot, scatter plot and box-plot, all within one frame, for all the variables in input data and store it in your working directory. It is preferable for density plot to have a bell shaped curve aligned to the center of chart area. Scatterplot reveals if there is a underlying linear relationship between the predictor and response while the box-plot reveals any outliers in the variable (in which case, dots will be present outside the top and bottom lines (IQR)).
***
## Outlier analysis
Outlier analysis is one the most important steps of data analysis and there are a lot of methods to handle that. There are differant reasons to have outliers on the data set.
One way of addressing outlier problem is eliminating from data set. In R you can remove outliers as follow:
```{r eval=FALSE}
x <- x[!x %in% boxplot.stats(x)$out]
```
But we are not going to remove outliers, instead of that we will replace those with `NA` and treate them as missing values.
```{r warning=FALSE, message=FALSE}
# Function to replace outliers with NA
replace_outlier_with_missing <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...) # get %iles
H <- 1.5 * IQR(x, na.rm = na.rm) # outlier limit threshold
y <- x
y[x < (qnt[1] - H)] <- NA # replace values below lower bounds
y[x > (qnt[2] + H)] <- NA # replace values above higher bound
y # returns treated variable
}
```
After replacing outliers in continious variables with `NA`s we can combine them together:
```{r warning=FALSE, message=FALSE}
inputData = as.data.frame (sapply(inputData_cont, replace_outlier_with_missing)) %>%
cbind(inputData_response, ., inputData_cat)
```
We can find number of missing values per column as well:
```{r message=FALSE, warning=FALSE}
apply(inputData, 2, function(x) sum(is.na(x)))
```
***
## Missing value treatement
**Approaches for handling missing values**
**Eliminating the variable**: If a particular variable has more than 30% missing values, it is advisable to consider removing that variable altogether.
**Eliminating the observations**: If there are missing values scattered through out your dataset and you have a good number of observations in your sample, you may choose to eliminate those observations with missing values. After doing this, if you find losing more than 30% of sample size, then you probably need to look for any particular variable that is contributing the most number of missing values.
**Imputation**: Alternatively, you may also impute the missing values with one of these place holders.
There are several methods to handle missing values:
* Method 1: Replace missing values with `mean` / `median` / `most frequent` value(s);
```{r eval=FALSE}
# impute missing values with the mean using 'impute' func from 'Hmisc' package
inputdata_cont_matrix <- sapply(inputData_cont, FUN=function(x){impute(x, mean)})
```
* Method 2: Predict the missing values by modelling it as a dependent variable with a regression model based approach;
* Method 3: k-Nearest neighbor imputation method.
```{r message=FALSE, warning=FALSE}
if( anyNA(inputData) ) {
# k = 10
inputData[, !names(inputData) %in% response_name] <- knnImputation(inputData[, !names(inputData) %in% response_name])
}
```
***
## Correlation analysis
The correlogram is a graphical representation of correlation amongst all combinations of response and predictor variables. The investigator may choose to pick variables that have high correlation with the response based on a pre-determined cutoff. But we do not do that filtering here, since it is possible that a variable with relatively lower correlation can prove significant and valuable in regression models to capture unexplained variations.
Let's see the correlation between variables:
```{r message=FALSE, warning=FALSE}
inputData_response_cont_noNA <- na.omit(cbind(inputData_response, inputData_cont))
chart.Correlation(inputData_response_cont_noNA, histogram=TRUE, pch=19)
```
***
## Data modelling
How to find the best model?
Split data set into validation and training data sets.
```{r eval=TRUE}
inputData = cbind(day=inputData_day, inputData)
test = inputData %>% filter(day %in% c(17,18,19))
train = inputData %>% filter(! day %in% c(17,18,19))
```
Size of the data sets:
```{r eval=TRUE}
dim(test); dim(train)
```
**Cross-validation**
We are going to use `Random Forest` algorithm. In order to find the best parameter for algorithm we will tune on `ntree` parameter using `5-fold cross validation`.
```{r eval=TRUE}
k = 5
# tuning parameter: number of trees.
trees = c(30, 50, 70)
frml = "registered ~ .-day"
errors = matrix(data = NA, nrow = k, ncol = length(trees))
for(i in 1:length(trees)){
for(j in seq(k)){
index = sample(1:16, 3, replace = FALSE)
test.cv = filter(train, day %in% index)
train.cv = filter(train, ! day %in% index)
# fit the model
fit = randomForest(as.formula(frml), data = train.cv, ntree = trees[i])
pred = predict(fit, test.cv)
# Root-mean-square error - RMSE
errors[j,i] = sqrt(sum((test.cv$registered - pred)^2) / length(test.cv))
}
}
```
Let's find out the best parameter for `random forest`.
```{r}
## plot the results
mean.cv.errors=apply(errors,2,mean)
mean.cv.errors
mean.cv.errors[which.min(mean.cv.errors)]
plot(trees, mean.cv.errors,type='b', main="Parameter tuning",
xlab="Number of trees", ylab="Mean RMSE"); grid()
points(trees[which.min(mean.cv.errors)], mean.cv.errors[which.min(mean.cv.errors)], col="red", pch=20)
```
Once we find out the best parameter, now we can calculate the test error of model on `test` data.
```{r warning=FALSE, message=FALSE}
# Train model on `training` data set
model = randomForest(as.formula(frml), data = train, ntree = which.min(mean.cv.errors))
# Get the test error
pred = predict(model, test)
```
```{r warning=FALSE, message=FALSE}
# Root-mean-square error
sqrt(sum((test$registered - pred)^2) / length(test))
```
## Follow me
Twitter [toghrul_jafarov](https://twitter.com/toghrul_jafarov)
GitHub [jtoghrul](https://github.com/jtoghrul)
## Referance
* [Advanced Linear Regression](http://rstatistics.net/linear-regression-advanced-modelling-algorithm-example-with-r/).
* [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/).
* [SciPY 2015](http://scipy2015.scipy.org/ehome/index.php?eventid=115969&).