title | output |
---|---|
Week 2 Reproducible research |
html_document |
First reading the data in
d <- read.csv("activity.csv")
Total number of steps per day. Calculating the total number per day using tapply, then plotting a histogram and calculating the mean.
spd <- tapply(d$steps,d$date,sum)
hist(spd, main="Number of steps taken per day")
mean(spd,na.rm=TRUE)
## [1] 10766.19
median(spd,na.rm=TRUE)
## [1] 10765
Calculating the means for each inerval and then plotting. Also subsetting the interval with the largest mean.
ada <- tapply(d$steps,d$interval,mean,na.rm=TRUE)
plot(names(ada),ada,type = "l",xlab="Interval")
ada[ada=max(ada)]
## 1705
## 56.30189
The interval 1705 on average across all the days in the dataset, contains the maximum number of steps.
Counting the number of NAs
sum(!complete.cases(d))
## [1] 2304
The total number of repws with NAs = 2304.
I have decided to impute missing values with the mean of that 5 minute interval.
dc <- d
dc$steps[is.na(dc$steps)] = ave(dc$steps,
dc$interval,
FUN=function(x)mean(x,
na.rm = T))[is.na(d$steps)]
The code above code does that
Histogram of total number of steps each day with missing data imputed. Using the same strategy as was used for the previous histogram. Getting the mean and median as well.
spdc <- tapply(dc$steps,dc$date,sum)
hist(spdc, main="Number of steps taken per day")
mean(spdc)
## [1] 10766.19
median(spdc)
## [1] 10766.19
The means of the imputed and non imputed datasets are the same, showing that imputing does not have an effect, however the median is smaller in the non imputed data.
Creating a new factor variable in the dataset with two levels – “weekday” and “weekend” indicating whether a given date is a weekday or weekend day.
dc$date <- as.Date(dc$date)
dc$day <- ifelse(weekdays(dc$date)=="Saturday","weekend","")
dc$day <- ifelse(weekdays(dc$date)=="Sunday","weekend",dc$day)
dc$day <- ifelse(dc$day=="","weekday",dc$day)
dc$day <- as.factor(dc$day)
Creating a plot with average over weekday/weekend
library(ggplot2)
dc$interval <- as.integer(dc$interval)
ggplot(dc, aes(x=interval, y=steps)) + stat_summary(fun.y="mean", geom="point") + geom_smooth()+ facet_grid(day~.)
## `geom_smooth()` using method = 'gam'