Skip to content

Commit 2fad949

Browse files
authored
Bug/tutorial06 lessons02,03,05 (#239)
* Turning empty exercises into the multiple choice questions that were intended. * Putting data needed for exercises into the setup chunk. * Various bug fixes in tutorial 6 lesson 05. Most related to data not being accessible to exercises or multiple choice questions not being implemented properly.
1 parent 5226c73 commit 2fad949

3 files changed

Lines changed: 125 additions & 71 deletions

File tree

06-model-infer/02-lesson/06-02-lesson.Rmd

Lines changed: 33 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,13 @@ manytwins_BS <- twins |>
4444
4545
twin_lm <- lm(Foster ~ Biological, data = twins)
4646
47+
set.seed(4747)
48+
boot_slope <- twins |>
49+
specify(Foster ~ Biological) |>
50+
generate(reps = 1000, type = "bootstrap") |>
51+
calculate(stat = "slope")
52+
53+
4754
# Hash generation helpers
4855
# Should ideally be loaded from the imstutorials package when it exists
4956
is_server_context <- function(.envir) {
@@ -653,6 +660,8 @@ abs_obs_slope <- lm(Foster ~ Biological, data = twins) |>
653660
pull(estimate) |>
654661
# Take the absolute value
655662
abs()
663+
664+
abs_obs_slope
656665
```
657666

658667

@@ -706,6 +715,7 @@ perm_slope |>
706715
# Get prop. cases where abs. permuted slope is greater than or equal to abs. observed slope
707716
p_value = mean(abs_perm_slope >= abs_obs_slope)
708717
)
718+
709719
```
710720

711721

@@ -714,23 +724,31 @@ perm_slope |>
714724
### Inference on slope
715725

716726

717-
#### What can we conclude based on the p-value associated with the twins data?
718727

719728

720-
```{r ex27, exercise=TRUE}
721729

730+
731+
```{r option-ex27, echo=FALSE}
732+
question("What can we conclude based on the p-value associated with the twins data?",
733+
answer("If there were no association between foster and biological twin IQ (no nature) in the population, we would be extremely unlikely to have collected a sample of data like we did. ", correct=TRUE, message="Right! Remember that here the p-value is measuring the probability of estimated slope being as large as what we observed if the population slope really is 0."),
734+
answer("A biological twin's IQ being higher causes a foster twin's IQ to be higher.", message="No. Remember that correlation does not prove causation"),
735+
answer("Biological twins' IQs are higher than foster twins' IQs, on average.", message="No. Remember that we are looking at whether the IQs tend to vary in the same direction."),
736+
answer("Given the data, the probability of biological and foster twins' IQs being unrelated is close to zero.", message="No. Remember that we are looking at the probability of these data given the null hypothesis being true, not the probability of the hypothesis given the data!"),
737+
allow_retry = TRUE
738+
)
722739
```
723740

741+
724742
```{r ex27-hint}
725743
Rememeber that a p-value is the probability of seeing data this (or more) extreme, given the null hypothesis is true.
726744
```
727745

728746

729747
```{r ex27-solution}
730-
- If there were no association between foster and biological twin IQ (no nature) in the population, we would be extremely unlikely to have collected a sample of data like we did.
731-
- A biological twin's IQ being higher causes a foster twin's IQ to be higher.
732-
- Biological twins' IQs are higher than foster twins' IQs, on average.
733-
- Given the data, the probability of biological and foster twins' IQs being unrelated is close to zero.
748+
-
749+
-
750+
-
751+
-
734752
```
735753

736754

@@ -1025,26 +1043,20 @@ boot_slope |>
10251043

10261044
### Inference from randomization and bootstrapped distributions
10271045

1028-
Throughout this lesson we have investigated the slope associated with the regression of `Foster` twins on `Biological` twins. The inference question was based on a randomization test assuming no relationship between the two types of twins (i.e., a slope of zero). The confidence intervals investigated a research question associated with a 100% nature relationship (i.e., a slope of one). What are the appropriate conclusions of this study?
1029-
1030-
1031-
```{r ex211, exercise=TRUE}
1032-
1033-
```
1046+
Throughout this lesson we have investigated the slope associated with the regression of `Foster` twins on `Biological` twins. The inference question was based on a randomization test assuming no relationship between the two types of twins (i.e., a slope of zero). The confidence intervals investigated a research question associated with a 100% nature relationship (i.e., a slope of one).
10341047

10351048

1036-
```{r ex211-hint}
1037-
Remember the difference between the population slope and the estimated slope. What do the inferential procedures you've performed say about each one?
1049+
```{r option-ex211, echo=FALSE}
1050+
question("What are the appropriate conclusions of this study?",
1051+
answer("Zero is not a plausible value for the population slope, one is a plausible value for the population slope.", correct=TRUE, message="Right! Remember that any value in the confidence interval is a 'plausible value' for the population slope. Any value outside of the confidence interval is 'not plausible'."),
1052+
answer("Zero is not a plausible value for the estimated slope, one is a plausible value for the estimated slope.", message="No. Remember that we know the exact value of the 'estimated slope'. The challenge lies in using this estimated slope to discover 'plausible values' for the population slope!"),
1053+
answer("Zero is a plausible value for the population slope, one is not a plausible value for the population slope.", message="No. Remember that any value in the confidence interval is a 'plausible value' for the population slope. Any value outside of the confidence interval is 'not plausible'."),
1054+
answer("Zero is a plausible value for the estimated slope, one not is a plausible value for the estimated slope.", message="No. Remember that we know the exact value of the 'estimated slope'. The challenge lies in using this estimated slope to discover 'plausible values' for the population slope!"),
1055+
allow_retry = TRUE
1056+
)
10381057
```
10391058

10401059

1041-
```{r ex211-solution}
1042-
- Zero is not a plausible value for the population slope, one is a plausible value for the population slope.
1043-
- Zero is not a plausible value for the estimated slope, one is a plausible value for the estimated slope.
1044-
- Zero is a plausible value for the population slope, one is not a plausible value for the population slope.
1045-
- Zero is a plausible value for the estimated slope, one not is a plausible value for the estimated slope.
1046-
```
1047-
10481060

10491061

10501062
## Congratulations!

06-model-infer/03-lesson/06-03-lesson.Rmd

Lines changed: 63 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,64 @@ starbucks3 <- starbucks |>
8282
starbucks_many <- starbucks |>
8383
rep_sample_n(size=20, replace=FALSE, reps=50)
8484
85+
86+
set.seed(4747)
87+
twins_perm <- twins |> rep_sample_n(size=nrow(twins), reps=500) |>
88+
group_by(replicate) |>
89+
mutate(Biological_perm = sample(Biological)) |>
90+
do(tidy(lm(Foster ~ Biological_perm, data=.)))
91+
92+
biological_perm <- twins_perm |>
93+
filter(term == "Biological_perm")
94+
95+
model <- lm(Foster ~ Biological, data = twins)
96+
97+
biological_term <- model |>
98+
tidy() |>
99+
filter(term == "Biological")
100+
101+
# Make sure all columns print
102+
options(tibble.width = Inf)
103+
104+
105+
set.seed(4747)
106+
perm_slope <- twins |>
107+
specify(Foster ~ Biological) |>
108+
hypothesize(null = "independence") |>
109+
generate(reps = 1000, type = "permute") |>
110+
calculate(stat = "slope")
111+
112+
obs_slope <- lm(Foster ~ Biological, data = twins) |>
113+
tidy() |>
114+
filter(term == "Biological") |>
115+
select(estimate) |>
116+
pull()
117+
118+
119+
set.seed(4747)
120+
boot_slope <- twins |>
121+
specify(Foster ~ Biological) |>
122+
generate(reps = 1000, type = "bootstrap") |>
123+
calculate(stat = "slope")
124+
125+
alpha <- 0.05
126+
degrees_of_freedom <- nrow(twins) - 2
127+
confidence_level <- 1 - alpha
128+
p_upper <- 1 - alpha / 2
129+
critical_value <- qt(p_upper, df = degrees_of_freedom)
130+
model <- lm(Foster ~ Biological, data = twins)
131+
new_twins <- data.frame(Biological = seq(70, 130, 15))
132+
augmented_model <- model |> augment(newdata = new_twins, se_fit = TRUE)
133+
134+
predictions <- augmented_model |>
135+
# Calculate a confidence interval on the predicted values
136+
mutate(
137+
lower_mean_prediction = .fitted - critical_value * .se.fit,
138+
upper_mean_prediction = .fitted + critical_value * .se.fit
139+
)
140+
141+
142+
85143
# Hash generation helpers
86144
# Should ideally be loaded from the imstutorials package when it exists
87145
is_server_context <- function(.envir) {
@@ -319,17 +377,6 @@ question("Why does it matter if the sampling distribution is accurate?",
319377
Using the permuted datasets (recall, the randomization forces the null hypothesis to be true), investigate the distribution of the standardized slope statistics (the slope, which has been divided by the standard error). Note that the distribution of the standardized slope is well described by a t-distribution.
320378

321379

322-
```{r echo=TRUE, message=FALSE, eval=FALSE}
323-
set.seed(4747)
324-
twins_perm <- twins |> rep_sample_n(size=nrow(twins), reps=500) |>
325-
group_by(replicate) |>
326-
mutate(Biological_perm = sample(Biological)) |>
327-
do(tidy(lm(Foster ~ Biological_perm, data=.)))
328-
```
329-
330-
331-
332-
333380
- Filter `twins_perm` for rows where `term` equals `"Biological_perm"`.
334381
- Calculated the degrees of freedom of `twins`, as the number of its rows minus two.
335382

@@ -497,16 +544,6 @@ The biological term from the model is available as `biological_term`.
497544

498545

499546

500-
```{r echo=TRUE, eval=FALSE}
501-
model <- lm(Foster ~ Biological, data = twins)
502-
biological_term <- model |>
503-
tidy() |>
504-
filter(term == "Biological")
505-
506-
# Make sure all columns print
507-
options(tibble.width = Inf)
508-
```
509-
510547

511548
```{r ex35, exercise=TRUE}
512549
# Calculate the degrees of freedom of twins
@@ -802,6 +839,8 @@ p_upper <- 1 - alpha / 2
802839
803840
# Find the critical value from the t-distribution
804841
critical_value <- qt(p_upper, df = degrees_of_freedom)
842+
843+
critical_value
805844
```
806845

807846

@@ -917,6 +956,9 @@ p_lower <- alpha / 2
917956
918957
# ... and the upper one
919958
p_upper <- 1 - alpha / 2
959+
960+
p_lower
961+
p_upper
920962
```
921963

922964

06-model-infer/05-lesson/06-05-lesson.Rmd

Lines changed: 29 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,12 @@ change <- change |>
2929
mutate(Amount = Dollars) |>
3030
dplyr::select(-Dollars)
3131
32+
33+
LAhomes <- LAhomes |>
34+
filter(bed > 0)
35+
36+
37+
3238
# Hash generation helpers
3339
# Should ideally be loaded from the imstutorials package when it exists
3440
is_server_context <- function(.envir) {
@@ -260,26 +266,22 @@ You will need to run the linear model before answering the question:
260266
261267
```
262268

263-
Consider data collected by Andrew Bray at Reed College on characteristics of LA Homes in 2010. The model is given below, and your task is to provide the appropriate interpretation of the coefficient on `log(sqft)`?
269+
Consider data collected by Andrew Bray at Reed College on characteristics of LA Homes in 2010. The model is given below, and your task is to provide the appropriate interpretation of the coefficient on `log(sqft)`.
264270

265271
Note: you must be careful to avoid causative interpretations. Additional square footage does not necessarily cause the price of a specific house to go up. The interpretation of the coefficient describes the estimate of the average price of homes at a given square footage.
266272

267273

268-
```{r ex, exercise=TRUE}
269-
270-
```
271-
272-
```{r ex-hint}
273-
Remember that a log-log model has a special kind of interpretation.
274+
```{r option-ex, echo=FALSE}
275+
question("What is the appropriate interpretation of the coefficient on `log(sqft)`?",
276+
answer("Each additional square foot of house size produces an estimate of the average price which is $1.44 more.", message="No, the interpretation of a log-log model is different than a linear model."),
277+
answer("Each additional square foot of house size produces an estimate of the average price which is $1,442 more.", message="No. Try again."),
278+
answer("Each additional square foot of house size produces an estimate of the average price which is 1.44% higher.", message="No, remember that both Y and X have been log transformed."),
279+
answer("Each additional 1% of square footage produces an estimate of the average price which is $1.44 more.", message="No, remember that both Y and X have been log transformed."),
280+
answer("Each additional 1% of square footage produces an estimate of the average price which is 1.44% higher.", correct = TRUE, message="Right! After the log-log transformation, our slope estimates the percent change in Y for each 1% change in X"),
281+
allow_retry = TRUE
282+
)
274283
```
275284

276-
```{r ex-solution}
277-
- Each additional square foot of house size produces an estimate of the average price which is $1.44 more.
278-
- Each additional square foot of house size produces an estimate of the average price which is $1,442 more.
279-
- Each additional square foot of house size produces an estimate of the average price which is 1.44% higher.
280-
- Each additional 1% of square footage produces an estimate of the average price which is $1.44 more.
281-
- Each additional 1% of square footage produces an estimate of the average price which is 1.44% higher.
282-
```
283285

284286

285287

@@ -630,27 +632,25 @@ lm(Price ~ Service + Food + Decor, data = restNYC) |> tidy()
630632
### Interpreting coefficients
631633

632634

633-
What is the correct interpretation of the coefficient on `Service` in the linear model which regresses `Price` on `Service`, `Food`, and `Decor`?
634-
635635

636636
You will need to run the linear model before answering the question:
637637
`lm(Price ~ Service + Food + Decor, data=restNYC) |> tidy()`
638638

639639

640-
```{r ex56, exercise=TRUE}
641-
642-
```
643640

644-
```{r ex56-hint}
645-
You have to consider the interpretation of a single variable while holding the other variables constant.
646-
```
647-
648-
649-
```{r ex56-solution}
650-
- For every one unit increase in `Service`, the predicted average `Price` is expected to increase by 0.135.
651-
- For every one unit increase in `Service`, the predicted average `Price` is expected to increase by 0.135, given fixed values of `Food` and `Decor`.
652-
- For every one unit increase in `Service`, the predicted average `Price` is expected to increase by 0.135, for any possible value of `Food` and `Decor`.
653-
- Given that `Food` and `Decor` are in the model, `Service` is not significant, and we cannot know whether it has effect on modeling `Price`.
641+
```{r ex56, exercise=TRUE}
642+
lm(Price ~ Service + Food + Decor, data=restNYC) |> tidy()
643+
```
644+
645+
```{r option-ex56, echo=FALSE}
646+
question("What is the correct interpretation of the coefficient on `Service` in the linear model which regresses `Price` on `Service`, `Food`, and `Decor`?
647+
?",
648+
answer("For every one unit increase in `Service`, the predicted average `Price` is expected to increase by 0.135.", message="No. You have to consider the interpretation of a single variable while holding the other variables constant."),
649+
answer("For every one unit increase in `Service`, the predicted average `Price` is expected to increase by 0.135, given fixed values of `Food` and `Decor`.", message="This interpretation of the point estimate is correct, but consider the large amount of uncertainty in this estimate. There is a better answer."),
650+
answer("For every one unit increase in `Service`, the predicted average `Price` is expected to increase by 0.135, for any possible value of `Food` and `Decor`.", message="No. You have to consider the interpretation of a single variable while holding the other variables constant. Remember that `Food`, `Decor`, and `Service` are all positively correlated: if you find a restaurant with better `Service`, it also likely has better `Food` and better `Decor`, so a fair comparison requires holding those other variables constant."),
651+
answer("Given that `Food` and `Decor` are in the model, `Service` is not significant, and we cannot know whether it has effect on modeling `Price`", correct=TRUE, message="Right!"),
652+
allow_retry = TRUE
653+
)
654654
```
655655

656656

0 commit comments

Comments
 (0)