Bug/tutorial06 lessons02,03,05 (#239)

mamcisaac · web-flow · commit 2fad949d1960 · 2023-11-26T00:28:16.000-05:00
* Turning empty exercises into the multiple choice questions that were intended.

* Putting data needed for exercises into the setup chunk.

* Various bug fixes in tutorial 6 lesson 05. Most related to data not being accessible to exercises or multiple choice questions not being implemented properly.
diff --git a/06-model-infer/02-lesson/06-02-lesson.Rmd b/06-model-infer/02-lesson/06-02-lesson.Rmd
@@ -44,6 +44,13 @@ manytwins_BS <- twins |>
 
 twin_lm <- lm(Foster ~ Biological, data = twins)
 
+set.seed(4747)
+boot_slope <- twins |>
+  specify(Foster ~ Biological) |>
+  generate(reps = 1000, type = "bootstrap") |>
+  calculate(stat = "slope")
+
+
 # Hash generation helpers
 # Should ideally be loaded from the imstutorials package when it exists
 is_server_context <- function(.envir) {
@@ -653,6 +660,8 @@ abs_obs_slope <- lm(Foster ~ Biological, data = twins) |>
   pull(estimate) |>
   # Take the absolute value
   abs()
+
+abs_obs_slope
 ```
 
 
@@ -706,6 +715,7 @@ perm_slope |>
     # Get prop. cases where abs. permuted slope is greater than or equal to abs. observed slope
     p_value = mean(abs_perm_slope >= abs_obs_slope)
   )
+
 ```
 
 
@@ -714,23 +724,31 @@ perm_slope |>
 ### Inference on slope
 
 
-#### What can we conclude based on the p-value associated with the twins data?
 
 
-```{r ex27, exercise=TRUE}
 
+
+```{r option-ex27, echo=FALSE}
+question("What can we conclude based on the p-value associated with the twins data?",
+  answer("If there were no association between foster and biological twin IQ (no nature) in the population, we would be extremely unlikely to have collected a sample of data like we did. ", correct=TRUE, message="Right! Remember that here the p-value is measuring the probability of estimated slope being as large as what we observed if the population slope really is 0."),
+  answer("A biological twin's IQ being higher causes a foster twin's IQ to be higher.", message="No. Remember that correlation does not prove causation"),
+  answer("Biological twins' IQs are higher than foster twins' IQs, on average.", message="No. Remember that we are looking at whether the IQs tend to vary in the same direction."),
+  answer("Given the data, the probability of biological and foster twins' IQs being unrelated is close to zero.", message="No. Remember that we are looking at the probability of these data given the null hypothesis being true, not the probability of the hypothesis given the data!"),
+  allow_retry = TRUE
+)
 ```
 
+
 ```{r ex27-hint}
 Rememeber that a p-value is the probability of seeing data this (or more) extreme, given the null hypothesis is true.
 ```
 
 
 ```{r ex27-solution}
-- If there were no association between foster and biological twin IQ (no nature) in the population, we would be extremely unlikely to have collected a sample of data like we did. 
-- A biological twin's IQ being higher causes a foster twin's IQ to be higher.
-- Biological twins' IQs are higher than foster twins' IQs, on average.
-- Given the data, the probability of biological and foster twins' IQs being unrelated is close to zero.
+- 
+- 
+-
+- 
 ```
 
 
@@ -1025,26 +1043,20 @@ boot_slope |>
 
 ### Inference from randomization and bootstrapped distributions
 
-Throughout this lesson we have investigated the slope associated with the regression of `Foster` twins on `Biological` twins.  The inference question was based on a randomization test assuming no relationship between the two types of twins (i.e., a slope of zero).  The confidence intervals investigated a research question associated with a 100% nature relationship (i.e., a slope of one).  What are the appropriate conclusions of this study?
-
-
-```{r ex211, exercise=TRUE}
-
-```
+Throughout this lesson we have investigated the slope associated with the regression of `Foster` twins on `Biological` twins.  The inference question was based on a randomization test assuming no relationship between the two types of twins (i.e., a slope of zero).  The confidence intervals investigated a research question associated with a 100% nature relationship (i.e., a slope of one).  
 
 
-```{r ex211-hint}
-Remember the difference between the population slope and the estimated slope. What do the inferential procedures you've performed say about each one?
+```{r option-ex211, echo=FALSE}
+question("What are the appropriate conclusions of this study?",
+  answer("Zero is not a plausible value for the population slope, one is a plausible value for the population slope.", correct=TRUE, message="Right! Remember that any value in the confidence interval is a 'plausible value' for the population slope. Any value outside of the confidence interval is 'not plausible'."),
+  answer("Zero is not a plausible value for the estimated slope, one is a plausible value for the estimated slope.", message="No. Remember that we know the exact value of the 'estimated slope'. The challenge lies in using this estimated slope to discover 'plausible values' for the population slope!"),
+  answer("Zero is a plausible value for the population slope, one is not a plausible value for the population slope.", message="No. Remember that any value in the confidence interval is a 'plausible value' for the population slope. Any value outside of the confidence interval is 'not plausible'."),
+  answer("Zero is a plausible value for the estimated slope, one not is a plausible value for the estimated slope.", message="No. Remember that we know the exact value of the 'estimated slope'. The challenge lies in using this estimated slope to discover 'plausible values' for the population slope!"),
+  allow_retry = TRUE
+)
 ```
 
 
-```{r ex211-solution}
-- Zero is not a plausible value for the population slope, one is a plausible value for the population slope.
-- Zero is not a plausible value for the estimated slope, one is a plausible value for the estimated slope.
-- Zero is a plausible value for the population slope, one is not a plausible value for the population slope.
-- Zero is a plausible value for the estimated slope, one not is a plausible value for the estimated slope.
-```
-
 
 
 ## Congratulations!
diff --git a/06-model-infer/03-lesson/06-03-lesson.Rmd b/06-model-infer/03-lesson/06-03-lesson.Rmd
@@ -82,6 +82,64 @@ starbucks3 <- starbucks |>
 starbucks_many <- starbucks |>
   rep_sample_n(size=20, replace=FALSE, reps=50)
 
+
+set.seed(4747)
+twins_perm <- twins |> rep_sample_n(size=nrow(twins), reps=500) |>
+  group_by(replicate) |>
+  mutate(Biological_perm = sample(Biological)) |>
+  do(tidy(lm(Foster ~ Biological_perm, data=.)))
+
+biological_perm <- twins_perm |>
+  filter(term == "Biological_perm")
+
+model <- lm(Foster ~ Biological, data = twins)
+
+biological_term <- model |> 
+  tidy() |> 
+  filter(term == "Biological")
+
+# Make sure all columns print
+options(tibble.width = Inf)
+
+
+set.seed(4747)
+perm_slope <- twins |>
+  specify(Foster ~ Biological) |>
+  hypothesize(null = "independence") |>
+  generate(reps = 1000, type = "permute") |>
+  calculate(stat = "slope") 
+  
+obs_slope <- lm(Foster ~ Biological, data = twins) |>
+  tidy() |>   
+  filter(term == "Biological") |>
+  select(estimate) |>   
+  pull() 
+
+
+set.seed(4747)
+boot_slope <- twins |>
+  specify(Foster ~ Biological) |>
+  generate(reps = 1000, type = "bootstrap") |>
+  calculate(stat = "slope")
+
+alpha <- 0.05
+degrees_of_freedom <- nrow(twins) - 2
+confidence_level <- 1 - alpha
+p_upper <- 1 - alpha / 2
+critical_value <- qt(p_upper, df = degrees_of_freedom)
+model <- lm(Foster ~ Biological, data = twins)
+new_twins <- data.frame(Biological = seq(70, 130, 15))
+augmented_model <- model |> augment(newdata = new_twins, se_fit = TRUE)
+
+predictions <- augmented_model |>
+  # Calculate a confidence interval on the predicted values
+  mutate(
+    lower_mean_prediction = .fitted - critical_value * .se.fit,
+    upper_mean_prediction = .fitted + critical_value * .se.fit
+  )
+
+
+
 # Hash generation helpers
 # Should ideally be loaded from the imstutorials package when it exists
 is_server_context <- function(.envir) {
@@ -319,17 +377,6 @@ question("Why does it matter if the sampling distribution is accurate?",
 Using the permuted datasets (recall, the randomization forces the null hypothesis to be true), investigate the distribution of the standardized slope statistics (the slope, which has been divided by the standard error). Note that the distribution of the standardized slope is well described by a t-distribution.
 
 
-```{r echo=TRUE, message=FALSE, eval=FALSE}
-set.seed(4747)
-twins_perm <- twins |> rep_sample_n(size=nrow(twins), reps=500) |>
-  group_by(replicate) |>
-  mutate(Biological_perm = sample(Biological)) |>
-  do(tidy(lm(Foster ~ Biological_perm, data=.)))
-```
-
-
-
-
 - Filter `twins_perm` for rows where `term` equals `"Biological_perm"`.
 - Calculated the degrees of freedom of `twins`, as the number of its rows minus two.
 
@@ -497,16 +544,6 @@ The biological term from the model is available as `biological_term`.
 
 
 
-```{r echo=TRUE, eval=FALSE}
-model <- lm(Foster ~ Biological, data = twins)
-biological_term <- model |> 
-  tidy() |> 
-  filter(term == "Biological")
-
-# Make sure all columns print
-options(tibble.width = Inf)
-```
-
 
 ```{r ex35, exercise=TRUE}
 # Calculate the degrees of freedom of twins
@@ -802,6 +839,8 @@ p_upper <- 1 - alpha / 2
 
 # Find the critical value from the t-distribution
 critical_value <- qt(p_upper, df = degrees_of_freedom)
+
+critical_value
 ```
 
 
@@ -917,6 +956,9 @@ p_lower <- alpha / 2
 
 # ... and the upper one
 p_upper <- 1 - alpha / 2
+
+p_lower
+p_upper
 ```
 
 
diff --git a/06-model-infer/05-lesson/06-05-lesson.Rmd b/06-model-infer/05-lesson/06-05-lesson.Rmd
@@ -29,6 +29,12 @@ change <- change |>
   mutate(Amount = Dollars) |>
   dplyr::select(-Dollars)
 
+
+LAhomes <- LAhomes |>
+   filter(bed > 0)
+
+
+
 # Hash generation helpers
 # Should ideally be loaded from the imstutorials package when it exists
 is_server_context <- function(.envir) {
@@ -260,26 +266,22 @@ You will need to run the linear model before answering the question:
 
 ```
 
-Consider data collected by Andrew Bray at Reed College on characteristics of LA Homes in 2010. The model is given below, and your task is to provide the appropriate interpretation of the coefficient on `log(sqft)`?  
+Consider data collected by Andrew Bray at Reed College on characteristics of LA Homes in 2010. The model is given below, and your task is to provide the appropriate interpretation of the coefficient on `log(sqft)`.
 
 Note: you must be careful to avoid causative interpretations. Additional square footage does not necessarily cause the price of a specific house to go up. The interpretation of the coefficient describes the estimate of the average price of homes at a given square footage.
 
 
-```{r ex, exercise=TRUE}
-
-```
-
-```{r ex-hint}
-Remember that a log-log model has a special kind of interpretation.
+```{r option-ex, echo=FALSE}
+question("What is the appropriate interpretation of the coefficient on `log(sqft)`?",
+  answer("Each additional square foot of house size produces an estimate of the average price which is $1.44 more.", message="No, the interpretation of a log-log model is different than a linear model."),
+  answer("Each additional square foot of house size produces an estimate of the average price which is $1,442 more.", message="No. Try again."),
+  answer("Each additional square foot of house size produces an estimate of the average price which is 1.44% higher.", message="No, remember that both Y and X have been log transformed."),
+  answer("Each additional 1% of square footage produces an estimate of the average price which is $1.44 more.", message="No, remember that both Y and X have been log transformed."),
+  answer("Each additional 1% of square footage produces an estimate of the average price which is 1.44% higher.", correct = TRUE, message="Right! After the log-log transformation, our slope estimates the percent change in Y for each 1% change in X"),
+  allow_retry = TRUE
+)
 ```
 
-```{r ex-solution}
-- Each additional square foot of house size produces an estimate of the average price which is $1.44 more.
-- Each additional square foot of house size produces an estimate of the average price which is $1,442 more.
-- Each additional square foot of house size produces an estimate of the average price which is 1.44% higher.
-- Each additional 1% of square footage produces an estimate of the average price which is $1.44 more.
-- Each additional 1% of square footage produces an estimate of the average price which is 1.44% higher.
-```
 
 
 
@@ -630,27 +632,25 @@ lm(Price ~ Service + Food + Decor, data = restNYC) |> tidy()
 ### Interpreting coefficients
 
 
-What is the correct interpretation of the coefficient on `Service` in the linear model which regresses `Price` on `Service`, `Food`, and `Decor`?
-
 
 You will need to run the linear model before answering the question:
 `lm(Price ~ Service + Food + Decor, data=restNYC) |> tidy()`
 
 
-```{r ex56, exercise=TRUE}
-
-```
 
-```{r ex56-hint}
-You have to consider the interpretation of a single variable while holding the other variables constant.
-```
-
-
-```{r ex56-solution}
-- For every one unit increase in `Service`, the predicted average `Price` is expected to increase by 0.135.
-- For every one unit increase in `Service`, the predicted average `Price` is expected to increase by 0.135, given fixed values of `Food` and `Decor`.
-- For every one unit increase in `Service`, the predicted average `Price` is expected to increase by 0.135, for any possible value of `Food` and `Decor`.
-- Given that `Food` and `Decor` are in the model, `Service` is not significant, and we cannot know whether it has effect on modeling `Price`.
+```{r ex56, exercise=TRUE}
+lm(Price ~ Service + Food + Decor, data=restNYC) |> tidy()
+```
+
+```{r option-ex56, echo=FALSE}
+question("What is the correct interpretation of the coefficient on `Service` in the linear model which regresses `Price` on `Service`, `Food`, and `Decor`?
+?",
+  answer("For every one unit increase in `Service`, the predicted average `Price` is expected to increase by 0.135.", message="No. You have to consider the interpretation of a single variable while holding the other variables constant."),
+  answer("For every one unit increase in `Service`, the predicted average `Price` is expected to increase by 0.135, given fixed values of `Food` and `Decor`.", message="This interpretation of the point estimate is correct, but consider the large amount of uncertainty in this estimate. There is a better answer."),
+  answer("For every one unit increase in `Service`, the predicted average `Price` is expected to increase by 0.135, for any possible value of `Food` and `Decor`.", message="No. You have to consider the interpretation of a single variable while holding the other variables constant. Remember that `Food`, `Decor`, and `Service` are all positively correlated: if you find a restaurant with better `Service`, it also likely has better `Food` and better `Decor`, so a fair comparison requires holding those other variables constant."),
+  answer("Given that `Food` and `Decor` are in the model, `Service` is not significant, and we cannot know whether it has effect on modeling `Price`", correct=TRUE, message="Right!"),
+  allow_retry = TRUE
+)
 ```