forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
PA1_template.Rmd
80 lines (64 loc) · 2.43 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
title: "Week 2 Reproducible research"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
First reading the data in
```{r}
d <- read.csv("activity.csv")
```
### Total number of steps per day. Calculating the total number per day using tapply, then plotting a histogram and calculating the mean.
```{r}
spd <- tapply(d$steps,d$date,sum)
hist(spd, main="Number of steps taken per day")
mean(spd,na.rm=TRUE)
median(spd,na.rm=TRUE)
```
### Average daily activity pattern.
Calculating the means for each inerval and then plotting. Also subsetting the interval with the largest mean.
```{r}
ada <- tapply(d$steps,d$interval,mean,na.rm=TRUE)
plot(names(ada),ada,type = "l",xlab="Interval")
ada[ada=max(ada)]
```
The interval 1705 on average across all the days in the dataset, contains the maximum number of steps.
### Imputing missing values.
Counting the number of NAs
```{r}
sum(!complete.cases(d))
```
The total number of repws with NAs = 2304.
I have decided to impute missing values with the mean of that 5 minute interval.
```{r}
dc <- d
dc$steps[is.na(dc$steps)] = ave(dc$steps,
dc$interval,
FUN=function(x)mean(x,
na.rm = T))[is.na(d$steps)]
```
The code above code does that
Histogram of total number of steps each day with missing data imputed. Using the same strategy as was used for the previous histogram. Getting the mean and median as well.
```{r}
spdc <- tapply(dc$steps,dc$date,sum)
hist(spdc, main="Number of steps taken per day")
mean(spdc)
median(spdc)
```
The means of the imputed and non imputed datasets are the same, showing that imputing does not have an effect, however the median is smaller in the non imputed data.
### Are there differences in activity patterns between weekdays and weekends
Creating a new factor variable in the dataset with two levels – “weekday” and “weekend” indicating whether a given date is a weekday or weekend day.
```{r}
dc$date <- as.Date(dc$date)
dc$day <- ifelse(weekdays(dc$date)=="Saturday","weekend","")
dc$day <- ifelse(weekdays(dc$date)=="Sunday","weekend",dc$day)
dc$day <- ifelse(dc$day=="","weekday",dc$day)
dc$day <- as.factor(dc$day)
```
Creating a plot with average over weekday/weekend
```{r}
library(ggplot2)
dc$interval <- as.integer(dc$interval)
ggplot(dc, aes(x=interval, y=steps)) + stat_summary(fun.y="mean", geom="point") + geom_smooth()+ facet_grid(day~.)
```