forked from mGalarnyk/datasciencecoursera
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Added quiz 1 template to Regression Models
- Loading branch information
Showing
5 changed files
with
128 additions
and
6,561 deletions.
There are no files selected for viewing
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,110 +1,174 @@ | ||
# Getting and Cleaning Data Quiz 1 (JHU) Coursera | ||
# Regression Models Quiz 1 (JHU) Coursera | ||
|
||
Question 1 | ||
---------- | ||
The American Community Survey distributes downloadable data about United States communities. Download the 2006 microdata survey about housing for the state of Idaho using download.file() from here: | ||
Consider the data set given below | ||
|
||
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv | ||
```R | ||
x <- c(0.18, -1.54, 0.42, 0.95) | ||
``` | ||
|
||
And weights given by | ||
|
||
```R | ||
w <- c(2, 1, 3, 1) | ||
``` | ||
|
||
and load the data into R. The code book, describing the variable names is here: | ||
Give the value of μ that minimizes the least squares equation | ||
|
||
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FPUMSDataDict06.pdf | ||
∑ni=1wi(xi−μ)2 | ||
|
||
How many housing units in this survey were worth more than $1,000,000? | ||
* 0.1471 | ||
|
||
```R | ||
# fread url requires curl package on mac | ||
# install.packages("curl") | ||
* 0.0025 | ||
|
||
library(data.table) | ||
housing <- data.table::fread("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv") | ||
* 0.300 | ||
|
||
# VAL attribute says how much property is worth, .N is the number of rows | ||
# VAL == 24 means more than $1,000,000 | ||
housing[VAL == 24, .N] | ||
* 1.077 | ||
|
||
# Answer: | ||
# 53 | ||
```R | ||
minu <- sum(x*w) / sum(w) | ||
|
||
# Answer | ||
# 0.1471429 | ||
``` | ||
|
||
Question 2 | ||
---------- | ||
Use the data you loaded from Question 1. Consider the variable FES in the code book. Which of the "tidy data" principles does this variable violate? | ||
Consider the following data set | ||
|
||
### Answer | ||
Tidy data one variable per column | ||
```R | ||
x <- c(0.8, 0.47, 0.51, 0.73, 0.36, 0.58, 0.57, 0.85, 0.44, 0.42) | ||
y <- c(1.39, 0.72, 1.55, 0.48, 1.19, -1.59, 1.23, -0.65, 1.49, 0.05) | ||
``` | ||
Fit the regression through the origin and get the slope treating y | ||
|
||
Question 3 | ||
---------- | ||
Download the Excel spreadsheet on Natural Gas Aquisition Program here: | ||
as the outcome and x as the regressor. (Hint, do not center the data since we want regression through the origin, not through the means of the data.) | ||
|
||
* -0.04462 | ||
|
||
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FDATA.gov_NGAP.xlsx | ||
* -1.713 | ||
|
||
Read rows 18-23 and columns 7-15 into R and assign the result to a variable called: | ||
* 0.8263 | ||
|
||
dat | ||
* 0.59915 | ||
|
||
What is the value of: | ||
```R | ||
sum(dat$Zip*dat$Ext,na.rm=T) | ||
summary(lm(y~x-1)) | ||
|
||
# Answer | ||
# 0.8263 | ||
``` | ||
(original data source: http://catalog.data.gov/dataset/natural-gas-acquisition-program) | ||
|
||
```R | ||
fileUrl <- "http://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FDATA.gov_NGAP.xlsx" | ||
download.file(fileUrl, destfile = paste0(getwd(), '/getdata%2Fdata%2FDATA.gov_NGAP.xlsx'), method = "curl") | ||
Question 3 | ||
---------- | ||
Do data(mtcars) from the datasets package and fit the regression | ||
|
||
dat <- xlsx::read.xlsx(file = "getdata%2Fdata%2FDATA.gov_NGAP.xlsx", sheetIndex = 1, rowIndex = 18:23, colIndex = 7:15) | ||
sum(dat$Zip*dat$Ext,na.rm=T) | ||
model with mpg as the outcome and weight as the predictor. Give | ||
|
||
# Answer: | ||
# 36534720 | ||
``` | ||
the slope coefficient. | ||
|
||
* 0.5591 | ||
|
||
* 30.2851 | ||
|
||
* -5.344 | ||
|
||
* -9.559 | ||
|
||
Question 4 | ||
---------- | ||
Read the XML data on Baltimore restaurants from here: | ||
Consider data with an outcome (Y) and a predictor (X). The standard deviation of the predictor is one half that of the outcome. The correlation between the two variables is .5. What value would the slope coefficient for the regression model with Y as the outcome and X as the predictor? | ||
|
||
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml | ||
* 1 | ||
|
||
How many restaurants have zipcode 21231? | ||
* 0.25 | ||
|
||
Use http instead of https, which caused the message Error: XML content does not seem to be XML: 'https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml'. | ||
* 3 | ||
|
||
```R | ||
# install.packages("XML") | ||
library("XML") | ||
fileURL<-"https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml" | ||
doc <- XML::xmlTreeParse(sub("s", "", fileURL), useInternal = TRUE) | ||
rootNode <- XML::xmlRoot(doc) | ||
|
||
zipcodes <- XML::xpathSApply(rootNode, "//zipcode", XML::xmlValue) | ||
xmlZipcodeDT <- data.table::data.table(zipcode = zipcodes) | ||
xmlZipcodeDT[zipcode == "21231", .N] | ||
|
||
# Answer: | ||
# 127 | ||
``` | ||
* 4 | ||
|
||
Question 5 | ||
---------- | ||
The American Community Survey distributes downloadable data about United States communities. Download the 2006 microdata survey about housing for the state of Idaho using download.file() from here: | ||
Students were given two hard tests and scores were normalized to have empirical mean 0 and variance 1. The correlation between the scores on the two tests was 0.4. What would be the expected score on Quiz 2 for a student who had a normalized score of 1.5 on Quiz 1? | ||
|
||
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv | ||
* 1.0 | ||
|
||
using the fread() command load the data into an R object | ||
* 0.6 | ||
|
||
DT | ||
* 0.4 | ||
|
||
Which of the following is the fastest way to calculate the average value of the variable | ||
* 0.16 | ||
|
||
pwgtp15 | ||
|
||
broken down by sex using the data.table package? | ||
Question 6 | ||
---------- | ||
Consider the data given by the following | ||
|
||
```R | ||
DT <- data.table::fread("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv") | ||
x <- c(8.58, 10.46, 9.01, 9.64, 8.86) | ||
``` | ||
|
||
What is the value of the first measurement if x were normalized (to have mean 0 and variance 1)? | ||
|
||
Question 7 | ||
---------- | ||
Consider the following data set (used above as well). What is the intercept for fitting the model with x as the predictor and y as the outcome? | ||
|
||
# Answer (fastest): | ||
system.time(DT[,mean(pwgtp15),by=SEX]) | ||
```R | ||
x <- c(0.8, 0.47, 0.51, 0.73, 0.36, 0.58, 0.57, 0.85, 0.44, 0.42) | ||
y <- c(1.39, 0.72, 1.55, 0.48, 1.19, -1.59, 1.23, -0.65, 1.49, 0.05) | ||
``` | ||
|
||
* -1.713 | ||
|
||
* 1.252 | ||
|
||
* 2.105 | ||
|
||
* 1.567 | ||
|
||
Question 8 | ||
---------- | ||
You know that both the predictor and response have mean 0. What | ||
|
||
can be said about the intercept when you fit a linear regression? | ||
|
||
|
||
* It must be exactly one. | ||
|
||
* Nothing about the intercept can be said from the information given. | ||
|
||
* It must be identically 0. | ||
|
||
* It is undefined as you have to divide by zero. | ||
|
||
Question 9 | ||
---------- | ||
Consider the data given by | ||
|
||
```R | ||
x <- c(0.8, 0.47, 0.51, 0.73, 0.36, 0.58, 0.57, 0.85, 0.44, 0.42) | ||
``` | ||
|
||
What value minimizes the sum of the squared distances between these points and itself? | ||
|
||
* 0.573 | ||
|
||
* 0.36 | ||
|
||
* 0.44 | ||
|
||
* 0.8 | ||
|
||
Question 10 | ||
---------- | ||
Let the slope having fit Y as the outcome and X as the predictor be denoted as β1. Let the slope from fitting X as the outcome and Y as the predictor be denoted as γ1. Suppose that you divide β1 by γ1; in other words consider β1/γ1. What is this ratio always equal to? | ||
|
||
* 2SD(Y)/SD(X) | ||
|
||
* 1 | ||
|
||
* Var(Y)/Var(X) | ||
|
||
* Cor(Y,X) |
Oops, something went wrong.