Skip to content

Commit 171574a

Browse files
authored
typos, formatting, rewording for clarification (tutorial 5) (#266)
* fixing typos, editing for clarity and minor formatting - tutorial 5 lessons 1-8 * formatting math symbols - tutorial 5 lessons 1-2
1 parent 5865ff6 commit 171574a

8 files changed

Lines changed: 147 additions & 131 deletions

File tree

05-infer/01-lesson/05-01-lesson.Rmd

Lines changed: 26 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -235,7 +235,7 @@ p_hat_happy
235235

236236
We learn that around 77% of our sample is "happy". If this were a simple random sample from the American population, this would be a good estimate of the percent of all Americans that are very happy, but it's not a sure thing since we only asked a small proportion of them.
237237

238-
## How is does the GSS select respondents?
238+
## How does the GSS select respondents?
239239

240240
Remember that not all randomly sampled data is the same -- a sample can be a simple random sample, a stratified sample, a cluster sample, or something more complex. The respondents to the GSS come from a complex survey design and appropriate inference for these data would account for this survey design.
241241

@@ -342,7 +342,7 @@ knitr::include_graphics("images/boot-11.png")
342342

343343
To implement this, we start with our gss2016 data and then *specify* that we will focus on the happy column. Next we *generate* 500 replicate data sets through bootstrapping and for each one *calculate* the proportion that are "happy".
344344

345-
When we print this new object, we see we now have a data frame that contains 500 p-hats.
345+
When we print this new object, we see we now have a data frame that contains 500 $\hat{p}\text{s}$ (p-hats).
346346

347347
```{r echo = TRUE}
348348
boot_dist_happy <- gss2016 |>
@@ -512,7 +512,7 @@ boot_1_consci |>
512512

513513
### Constructing the CI
514514

515-
You've seen one example of how p-hat can vary upon resampling, but we need to do this many many times to get a good estimate of its variability. Here you will compute a full bootstrap distribution to estimate the standard error (SE) that will be used to form a confidence interval. You'll use an additional verb from infer, `calculate()`, to streamline this process of calculating many statistics from many data sets.
515+
You've seen one example of how $\hat{p}$ can vary upon resampling, but we need to do this many many times to get a good estimate of its variability. Here you will compute a full bootstrap distribution to estimate the standard error (SE) that will be used to form a confidence interval. You'll use an additional verb from infer, `calculate()`, to streamline this process of calculating many statistics from many data sets.
516516

517517
Take a moment to inspect the output of calculate. This function reduces your data frame to just two columns: one for the "stat"s and another for the "replicate" they correspond to.
518518

@@ -665,7 +665,7 @@ Let's look deeper into this by starting with the confidence interval that we've
665665

666666
### Happiness in 2016
667667

668-
The data from which this interval was constructed is from 2016, and we can plot both p-hat and the resulting interval on a number line here. To understand what is meant by confident, we need to consider how this interval fits into the big picture.
668+
The data from which this interval was constructed is from 2016, and we can plot both $\hat{p}$ and the resulting interval on a number line here. To understand what is meant by confident, we need to consider how this interval fits into the big picture.
669669

670670
```{r echo=FALSE, out.width = "60%"}
671671
knitr::include_graphics("images/confidence-interval.png")
@@ -713,7 +713,7 @@ knitr::include_graphics("images/p-05.png")
713713

714714
###
715715

716-
of the same size from that population and come up with a new p-hat and a new interval. It wouldn't be the same as our first, but it'd likely be similar.
716+
of the same size from that population and come up with a new $\hat{p}$ and a new interval. It wouldn't be the same as our first, but it'd likely be similar.
717717

718718
```{r echo=FALSE, out.width = "60%"}
719719
knitr::include_graphics("images/p-06.png")
@@ -729,7 +729,7 @@ knitr::include_graphics("images/p-07.png")
729729

730730
###
731731

732-
a new p-hat and a new interval.
732+
a new $\hat{p}$ and a new interval.
733733

734734
```{r echo=FALSE, out.width = "60%"}
735735
knitr::include_graphics("images/p-08.png")
@@ -853,7 +853,7 @@ In the following exercises you'll get the chance to explore these factors and ho
853853

854854
We learned that for a 95% confidence interval (a confidence level of .95), if we were to take many samples of the same size and compute many intervals, we would expect 95% of the resulting intervals to contain the parameter. Based on the set of confidence intervals plotted here, what is your best guess at the confidence level used in these intervals?
855855

856-
The population proportion is represented by the p in the cloud and the dotted line and each confidence interval is represented by a segment that extends out from it's p-hat. Intervals that capture the true value are in green; those that miss it are in red.
856+
The population proportion is represented by the p in the cloud and the dotted line and each confidence interval is represented by a segment that extends out from its $\hat{p}$. Intervals that capture the true value are in green; those that miss it are in red.
857857

858858
```{r echo=FALSE, out.width = "60%"}
859859
knitr::include_graphics("images/gssMany-happy-nocapture_65.png")
@@ -1036,18 +1036,24 @@ boot_dist_smaller_n <- gss2016_smaller |>
10361036
SE_smaller_n <- ___ |>
10371037
___ |>
10381038
___
1039+
1040+
SE_smaller_n
10391041
```
10401042

1041-
```{r ex12-hint, exercise=TRUE}
1043+
```{r ex12-hint}
10421044
SE_smaller_n <- boot_dist_smaller_n |>
10431045
summarize(se = sd(stat)) |>
10441046
___
1047+
1048+
SE_smaller_n
10451049
```
10461050

1047-
```{r ex12-solution, exercise=TRUE}
1051+
```{r ex12-solution}
10481052
SE_smaller_n <- boot_dist_smaller_n |>
10491053
summarize(se = sd(stat)) |>
10501054
pull()
1055+
1056+
SE_smaller_n
10511057
```
10521058

10531059
###
@@ -1083,7 +1089,7 @@ SE_small_n > SE_smaller_n
10831089

10841090
### SE with different p
10851091

1086-
You just saw the effect that _sample size_ can have on inference, but that's not the only variable in play here. Let's return now to our full data set and see what happens to the SE when we consider a category that has a different _population proportion_, p.
1092+
You just saw the effect that _sample size_ can have on inference, but that's not the only variable at play here. Let's return now to our full data set and see what happens to the SE when we consider a category that has a different _population proportion_, p.
10871093

10881094
Remember that the proportion of "High" confidence in science in 2016 was pretty close to 0.50.
10891095

@@ -1114,7 +1120,7 @@ boot_dist_low_p <- gss2016 |>
11141120
generate(reps = 500, type = "bootstrap") |>
11151121
calculate(stat = "prop")
11161122
1117-
SE_low_p <- boot_dist |>
1123+
SE_low_p <- boot_dist_low_p |>
11181124
___(se = ___) |>
11191125
___
11201126
@@ -1156,7 +1162,7 @@ c(SE_low_p, SE_consci)
11561162

11571163
You calculated two new standard errors.
11581164

1159-
One when there was less data, and the other where p-hat was low.
1165+
One when there was less data, and the other where $\hat{p}$ was low.
11601166

11611167
The different values that you observed demonstrate some important properties of standard errors: they will increase when n is small and also when p is close to 0.5.
11621168

@@ -1199,7 +1205,7 @@ A.K.A the "bell curve".
11991205
knitr::include_graphics("images/normal-curve.png")
12001206
```
12011207

1202-
That approximation is the normal distribution, also known as the bell curve. A useful result in mathematics says that if you have independent observations and a sufficiently large sample size, then p-hat will follow a normal distribution with a known standard deviation. This distribution is called the sampling distribution of p-hat and it's very similar to the bootstrap distribution in that it captures the variability of our estimate across many possible data sets.
1208+
That approximation is the normal distribution, also known as the bell curve. A useful result in mathematics says that if you have independent observations and a sufficiently large sample size, then $\hat{p}$ will follow a normal distribution with a known standard deviation. This distribution is called the sampling distribution of $\hat{p}$ and it's very similar to the bootstrap distribution in that it captures the variability of our estimate across many possible data sets.
12031209

12041210
### Standard deviation
12051211

@@ -1218,13 +1224,13 @@ What does "n is large" mean?
12181224
- $n \times \hat{p} \gt 10$
12191225
- $n \times(1 - \hat{p}) \gt 10$
12201226

1221-
When applying this result in practice, it's important to be sure that the assumptions of independence and a large sample aren't wildly off base. To assess independence, you need to consider the method by which the data was collected. A handy rule of thumb to determine if your sample size is large enough is to check that n times p-hat and n times 1 - p-hat are both greater than or equal to 10.
1227+
When applying this result in practice, it's important to be sure that the assumptions of independence and a large sample aren't wildly off base. To assess independence, you need to consider the method by which the data was collected. A handy rule of thumb to determine if your sample size is large enough is to check that n times $\hat{p}$ and n times 1 - $\hat{p}$ are both greater than or equal to 10.
12221228

12231229
### Calculating standard error: approximation
12241230

1225-
OK, let's try our hand at using this shortcut to find the standard error for the proportion of people that were happy. Let's recompute p-hat, then ask the number of rows in the `gss2016`. That's the sample size, `n`.
1231+
OK, let's try our hand at using this shortcut to find the standard error for the proportion of people that were happy. Let's recompute $\hat{p}$, then ask the number of rows in the `gss2016`. That's the sample size, `n`.
12261232

1227-
Let's check the rule-of-thumb to see if our sample size is large enough by multiplying n times p-hat and n times 1 minus p-hat. This gives 116 and 35, so our sample size should be sufficiently large. We also know that the gss uses random sampling to draw these observations, so is safe to assume that one person's answer is independent of the next.
1233+
Let's check the rule-of-thumb to see if our sample size is large enough by multiplying n times $\hat{p}$ and n times 1 minus $\hat{p}$. This gives 116 and 35, so our sample size should be sufficiently large. We also know that the gss uses random sampling to draw these observations, so is safe to assume that one person's answer is independent of the next.
12281234

12291235
```{r echo=TRUE}
12301236
p_hat_happy <- gss2016 |>
@@ -1245,7 +1251,7 @@ SE_happy_approx
12451251

12461252
### Calculating standard error: computation
12471253

1248-
How does it compare to our original computational approach using the bootstrap? Well, if we construct the bootstrap distribution for p-hat, then summarize it by finding it's standard deviation, we estimate a standard error of about 0.032. Those are remarkably similar values! Let's go a step farther.
1254+
How does it compare to our original computational approach using the bootstrap? Well, if we construct the bootstrap distribution for $\hat{p}$, then summarize it by finding it's standard deviation, we estimate a standard error of about 0.035. Those are remarkably similar values! Let's go a step farther.
12491255

12501256
```{r echo=TRUE}
12511257
boot_dist_happy <- gss2016 |>
@@ -1260,18 +1266,16 @@ SE_happy_boot
12601266

12611267
### Shape of sampling distributions
12621268

1263-
Let's also take a look at the shape of this bootstrap distribution. A density plot suggests that it's unimodal and symmetric. Let's add a layer to this plot that contains the normal curve that's centered at p-hat has uses the equation to find the standard deviation. And yes, let's make that curve purple.
1269+
Let's also take a look at the shape of this bootstrap distribution. A density plot suggests that it's unimodal and symmetric.
12641270

12651271
```{r echo=TRUE}
12661272
ggplot(boot_dist_happy, aes(x = stat)) +
12671273
geom_density()
12681274
```
12691275

1270-
We see that the normal approximation looks fairly similar to the density curve of our bootstrap distribution. This will be a recurring theme: that when an approximation method exists, it will tend to give very similar results to the computational method when the assumptions of that approximation are reasonable.
1271-
12721276
###
12731277

1274-
Let's add a layer to this plot that contains the normal curve that's centered at p-hat has uses the equation to find the standard deviation. And yes, let's make that curve purple.
1278+
Let's add a layer to this plot that contains the normal curve that's centered at $\hat{p}$ has uses the equation to find the standard deviation. And yes, let's make that curve purple.
12751279

12761280
```{r echo=TRUE}
12771281
ggplot(boot_dist_happy, aes(x = stat)) +
@@ -1347,7 +1351,7 @@ p_hat_meta
13471351
Now we'll construct the confidence interval around this proportion.
13481352

13491353
- Check the rules-of-thumb for the normal distribution being a decent approximation.
1350-
- Calculate the standard error using the approximation formula $\sqrt{\frac{\hat{p} \times (1 - \hat{p})}{n}}$.
1354+
- Calculate the standard error using the approximation formula $\sqrt{\frac{\hat{p} (1 - \hat{p})}{n}}$.
13511355
- Use `SE_meta_approx` to form a confidence interval for `p_hat`. The limits should be two standard errors either side of `p_hat`.
13521356

13531357
```{r ex16-setup, include=FALSE}

0 commit comments

Comments
 (0)