forked from bvkrauth/is4e
-
Notifications
You must be signed in to change notification settings - Fork 0
/
10-Inference.Rmd
810 lines (669 loc) · 32.7 KB
/
10-Inference.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
# Statistical inference {#statistical-inference}
```{r setup9, include=FALSE}
knitr::opts_chunk$set(echo = FALSE,
prompt = FALSE,
tidy = TRUE,
collapse = TRUE)
library("tidyverse")
```
In Chapter \@ref(statistics), we learned about estimation: the use of data and
statistics to construct the best possible guess at the value of some parameter
$\theta$.
In this chapter, we will pursue a different goal. Instead of estimating
the value of $\theta$ we will what values of $\theta$ can be confidently
ruled out given the evidence available in our data:
- A hypothesis test determines whether a *particular* value *can* be
ruled out.
- A confidence interval determines a *range* of values that *cannot*
be ruled out.
The set of procedures for constructing confidence intervals and
hypothesis tests is called statistical ***inference***.
::: goals
**Chapter goals**
In this chapter we will learn how to:
- Construct and perform a simple hypothesis test
- Construct a confidence interval
- Correctly interpret both hypothesis tests and confidence intervals
:::
## Principles of inference
### Evidence
The purpose of statistical inference is to systematically account for the
uncertainty associated with limited evidence. That is, there are important
aspects of the data generating process we do not know. The data provide
some evidence about those unknown aspects, but the evidence they provide
may not be strong. Statistical inference asks us what statements about
the data generating process can be made with confidence based on the data.
::: example
**Roulette**
Suppose you work as a casino regulator for the BCLC (British Columbia
Lottery Corporation, the crown corporation that regulates all commercial
gambling in B.C.). You have been given data with recent roulette
results from a particular casino and are tasked with determining whether
the casino is running a fair game.
Before getting caught up in math, let's think about how we might assess
evidence:
- If we have data from many games, and the win rate for gamblers is about
the same as the win probability in a fair game, then we would conclude there
is strong evidence the game is *fair or close to fair*.
- If we have data from many games, and the win rate for gamblers is much
higher or lower than the win probability in a fair game, then we would
conclude there is strong evidence that the game is *not fair*.
- If we have no data, or data from only a few games, we probably *can't tell*
whether the game is fair or unfair
In this chapter we will formalize these basic ideas about evidence.
:::
### A basic framework
For the remainder of this chapter, suppose we have a data set
$D_n = (x_1,x_2,\ldots,x_n)$ of size $n$. The data comes from an
unknown data generating process that includes an unknown
parameter of interest $\theta$.
::: example
**DGP and parameter of interest for roulette**
Let $D_n = (x_1,\ldots,x_n)$ be a data set of results from
$n = 100$ games of roulette at a local casino. More specifically,
let
$$x_i = I(\textrm{Red wins})$$
Our parameter of interest is the probability that
red wins:
$$p_{red} = \Pr(x_i = 1) = E(x_i)$$
We know that red wins in a fair game with probability
$p_{red} = 18/37 \approx 0.486$.
:::
## Hypothesis tests
We will start with hypothesis tests. The idea of a hypothesis test
is to determine whether the data rule out or ***reject*** a specific
value of the unknown parameter $\theta$.
Intuitively, if we have no (useful) data we cannot rule anything
out, but as we obtain more data, we can rule out more values.
### The null and alternative hypotheses
The first step in a hypothesis test is to define the
***null hypothesis***. The null hypothesis is a statement
about our parameter $\theta$ that takes the form:
$$H_0: \theta = \theta_0$$
where $\theta_0$ is a specific number. This is the value of
$\theta$ we are interested in ruling out.
The next step is to define the ***alternative hypothesis***. The
alternative hypothesis defines every other value of $\theta$
we are allowing, and is usually written as:
$$H_1: \theta \neq \theta_0$$
where $\theta_0$ is the same number as used in the null.
::: example
**Null and alternative for $p_{red}$**
In our roulette example, our null hypothesis is
that the game is fair:
$$H_0: p_{red} = 18/37 \approx 0.486$$
and the alternative hypothesis is that it is not fair:
$$H_1: p_{red} \neq 18/37$$
:::
Notice that there is something of an asymmetry between the null and
alternative hypothesis: the null is typically (though not necessarily)
a single value and the alternative is every other possible value.
:::fyi
**What null hypothesis to choose?**
Our framework here assumes that you already know what null hypothesis you
wish to test, but we might briefly consider how we might choose a null
hypothesis to test.
In some applications there are null hypotheses that are of clear interest
for that specific case:
- In our roulette example, the natural null to test is whether the
win probability matches that of a fair game ($p = p_{fair}$).
- When measuring the effect $\beta$ of one variable on another, the natural
null to test is "no effect at all" ($\beta = 0$).
- When comparing the mean of some characteristic or outcome across two
groups (for example, average wages of men and women), the natural null
to test is that they are the same ($\mu_m = \mu_W$)
- In epidemiology, a contagious disease will tend to spread if its
reproduction rate $R$ is greater than one, and decline if it
is less than one, so $R = 1$ is a natural null to test.
If there is no obvious null hypothesis, it may make sense to test many
null hypotheses and report all of the results. There is nothing
wrong with doing that.
:::
### The test statistic
Our next step is to construct a ***test statistic*** that can be
calculated from our data. A valid test statistic for a given null
hypothesis is a statistic $t_n$ that has the following two
properties:
1. The probability distribution of $t_n$ ***under the null***
(i.e., when $H_0$ is true) is *known*.
2. The probability distribution of $t_n$ ***under the alternative***
(i.e., when $H_1$ is true) is *different* from its
probability distribution under the null.
The test statistic is usually based on an estimator of the parameter,
and is usually constructed so that it is typically close to zero
when the null is true, and far from zero when the null is false. But
it does not need to be.
::: example
***A test statistic for roulette***
A natural test statistic for the win probability of a bet on red
would be the corresponding win frequency in our data. We could
use either the relative win frequency (which also happens to
be the sample average):
$$\hat{f}_{red} = \bar{x}_n = \frac{1}{n} \sum_{i=1}^n x_i$$
but it will be more convenient to use the absolute win frequency:
$$t_n = n\hat{f}_{red} = n\bar{x}_n =\sum_{i=1}^n x_i$$
Next we need to find the probability distribution of $t_n$
under the null, and under the alternative.
In general, since $x_i \sim Bernoulli(p_{red})$ we have:
$$t_n \sim Binomial(n,p_{red})$$
Remember that the $Binomial(n,p)$ distribution is the distribution corresponding
to the number of times an event with probability $p$ happens in $n$ independent
trials.
Under the null (when $H_0$ is true), $p_{red} = 18/37$ and so:
$$t_n \sim Binomial(100,18/37)$$
Since this distribution does not involve any unknown parameters,
our test statistic satisfies the requirement of having a known
distribution under the null.
Under the alternative (when $H_1$ is true), $p_{red}$ can take on any value
*other* than $18/37$. The sample size is still $n=100$, so the
distribution of the test statistic is:
$$t_n \sim Binomial(100,p_{red}) \textrm{ where $p_{red} \neq 18/37$ }$$
Notice that the distribution of our test statistic under the alternative
is not known, since $p_{red}$ is not known. But the distribution is
*different* under the alternative, and that is what we require from
our test statistic.
:::
### Significance and critical values
After choosing a test statistic $t_n$, the next step is to choose
***critical values***. The critical values are two numbers
$c_L$ and $c_H$ (where $c_L < c_H$) such that
- $t_n$ has a *high* probability of being between $c_L$ and $c_H$ when
the null is true.
- $t_n$ has a *lower* probability of being between $c_L$ and $c_H$ when
the alternative is true.
The range of values from $c_L$ to $c_H$ is called the ***critical range***
of our test.
Given the test statistic and critical values:
- We ***reject the null*** if $t_n$ is outside of the critical range.
- This means we have strong evidence that $H_0$ is false.
- The reason we reject here is that we know we would be unlikely to
observe such a value of $t_n$ if $H_0$ were true.
- We ***fail to reject the null*** if $t_n$ is inside of the critical
range.
- This means we do not have strong evidence that $H_0$ is false.
- This does not mean we have strong evidence that $H_0$ is true. We
may just not have enough evidence to reach a conclusion.
Notice that there is an asymmetry here: in the absence of evidence,
we will not reject *any* null hypotheses.
![Critical values and rejecting the null hypothesis](bin/reject.png)
How do we choose critical values? You can think of critical values as setting
a standard of evidence, so we need to balance two considerations:
- The probability of rejecting a false null is called the ***power***
of the test.
- We want our test to reject the null when it is false, so power is good.
- The probability of rejecting a true null is called the ***size***
or ***significance*** of a test.
- We do not want our test to reject the null when it is true, so size
is bad.
- There is always a trade off between power and size
- A *narrower* critical range (higher $c_L$ or lower $c_H$) will increase
the rejection rate, increasing both power (good) and size (bad).
- A *wider* critical range (lower $c_L$ or higher $c_H$) will reduce
the rejection rate, reducing both power (bad) and size (good).
Given this trade off between power and size, we might construct some
criterion that includes both (just like MSE includes both variance and
bias) and choose critical values to maximize that criterion. In practice,
we do not typically do that.
Instead, we follow a simple convention:
1. Set the size to a fixed value $\alpha$.
- In general, the conventional size varies by field, and typically
varies with how much data is typical in that field.
- In economics and most other social sciences, the usual convention
is to use a size of 5\% ($\alpha = 0.05$).
- We sometimes see 1\% ($\alpha = 0.01$) when working with larger data sets
or 10\% ($\alpha = 0.10$) when working with small data sets.
- In physics or genetics, where data sets are much larger, the
conventional size is much lower.
2. Calculate critical values that imply the desired size.
- With a size of 5\% $(\alpha = 0.05)$, we would:
- set $c_L$ to the 2.5 percentile (0.025 quantile) of the null distribution
- set $c_H$ to the 97.5 percentile (0.975 quantile) of the null distribution
- With a size of 10\% $(\alpha = 0.10)$, we would:
- set $c_L$ to the 5 percentile (0.05 quantile) of the null distribution
- set $c_H$ to the 95 percentile (0.95 quantile) of the null distribution
- With a size of $\alpha$, we would:
- set $c_L$ to the $\alpha/2$ quantile of the null distribution
- set $c_H$ to the $1-\alpha/2$ quantile of the null distribution
In other words, we set size equal to a conventional value, and let the
power be whatever is implied by that.
::: example
**Critical values for roulette**
We earlier showed that the distribution of $t_n$ under the null is:
$$t_n \sim Binomial(100,18/37)$$
We can get a size of 5\% by choosing:
$$c_L = 2.5 \textrm{ percentile of } Binomial(100,18/37)$$
$$c_H = 97.5 \textrm{ percentile of } Binomial(100,18/37)$$
We can then use Excel or R to calculate these critical values. In Excel,
the function you would use is `BINOM.INV()`
- The formula to calculate $c_L$ is `=BINOM.INV(100,18/37,0.025)`
- The formula to calculate $c_H$ is `=BINOM.INV(100,18/37,0.975)`
The calculations below were done in R:
```{r BinomialCriticalValues}
cat("2.5 percentile of binomial(100,18/37) =",
qbinom(0.025,100,18/37),
"\n")
cat("97.5 percentile of binomial(100,18/37) =",
qbinom(0.975,100,18/37),
"\n")
```
In other words we reject the null (at 5\% significance) that
the roulette wheel is fair if red wins fewer than 39 games
or more than 58 games.
:::
::: fyi
**A general test for a single probability**
We can generalize the test we have constructed so far to the case
of the probability of any event:
| Test component | Roulette example |General case |
|------------------------|--------------------------------------|---------------------------|
| Parameter | $p_{red} = \Pr(\textrm{Red wins})$ | $p = \Pr(\textrm{event})$ |
| Null hypothesis | $H_0:p_{red} = 18/37$ | $H_0:p = p_0$ |
| Alternative hypothesis | $H_1: p_{red} \neq 18/37$ |$H_1: p \neq p_0$ |
| Test statistic | $t = n\hat{f}_{RED}$ | $t = n\hat{f}_{\textrm{event}}$ |
| Null distribution | $Binomial(100,18/37)$ |$Binomial(n,p_0)$ |
| Critical value $c_L$ | 39 | 2.5 percentile of $Binomial(n,p_0)$ |
| Critical value $c_H$ | 58 | 97.5 percentile of $Binomial(n,p_0)$ |
| Decision | Reject if $t \notin [39,58]$ | Reject if $t \notin [c_L,c_H]$ |
:::
### The power of a test
As mentioned above, the power of a test is defined as the probability
of rejecting the null when it is false, and the alternative is true.
The size of a test is a number, since the distribution of the test statistic
is known under the null. Since the alternative typically allows
more than one value of the parameter $\theta$, the power
of a test is not a number but a *function* of the unknown true value
of $\theta$ (and sometimes other unknown features of the DGP):
$$power(\theta) = \Pr(\textrm{reject $H_0$})$$
In some cases we can actually calculate this function.
::: example
**The power curve for roulette**
Power curves can be tricky to calculate, and I will not ask you to calculate
them for this course. But they can be calculated, and it is useful to
see what they look like.
Figure \@ref(fig:PowerCurves) below depicts the power curve for the roulette test we have
just constructed; that is, we are testing the null that $p_{red} = 18/37$
at a 5\% size. The blue line depicts the power curve for $n=100$
as in our example, while the green line depicts the power curve for
$n=20$.
```{r PowerCurves, fig.cap = "*Power curves for the roulette example*"}
PowerCurveData <- tibble(theta = seq(0,1,length.out=380),
power20 = pbinom(qbinom(0.025,
20,
18/37),
20,
theta) +
(1 - pbinom(qbinom(0.975,
20,
18/37),
20,
theta)),
power100 = pbinom(qbinom(0.025,
100,
18/37),
100,
theta) +
(1 - pbinom(qbinom(0.975,
100,
18/37),
100,
theta)))
ggplot(data = PowerCurveData,
mapping = aes(x = theta,y = power20)) +
geom_line(col="green") +
geom_line(aes(y = power100), col="blue") +
xlab("true probability of winning") +
ylab("power") +
geom_text(label="n=100",x=0.7,y=0.8,col="blue") +
geom_text(label="n=20",x=0.9,y=0.8,col="green") +
geom_hline(yintercept = 0.05,col="gray") +
geom_vline(xintercept = 18/37,col="gray") +
labs(title = "Power curve for fair roulette wheel",
subtitle = "",
caption = "H0: Pr(red wins) = 18/37, significance = 0.05",
tag = "")
```
There are a few features I would like you to notice, all of which
are common to most regularly used tests:
- The power curve reaches its lowest value at the red point
$(18/37,0.05)$. Note that $18/37$ is the parameter value
under the null, and $0.05$ is the size of the test. In other words:
- The power is always at least as big as the size, and is usually bigger.
- We are more likely to reject the null when it is false than when it
is true. That's good!
- When a test has this desirable property, we call
it an ***unbiased*** test.
- The power increases as $\theta$ gets further from the null.
- That is, we are more likely to detect unfairness in a game
that is *very* unfair than when in one that is *a little* unfair.
- Power also increases with the sample size; the blue line
($n = 100$) is above the green line ($n = 20$).
Power analysis is often used by researchers to determine how much data
to collect. Each additional observation increases power but costs money,
so it is important to spend enough to get clear results but not
much more than that.
:::
::: fyi
**P values**
The convention of always using a 5\% significance level for
hypothesis tests is somewhat arbitrary and has some negative
unintended consequences:
1. Sometimes a test statistic falls just below or just above the
critical value, and small changes in the analysis can change
a result from reject to cannot-reject.
2. In many fields, unsophisticated researchers and journal
editors misinterpret "cannot reject the null" as "the null is true."
One common response to these issues is to report what is called
the ***p-value*** of a test. The p-value of a test is defined
as the significance level at which one would switch from rejecting
to not-rejecting the null. For example:
- If the p-value is 0.43 (43\%) we would not reject the null at 10\%,
5\% or 1\%.
- If the p-value is 0.06 (6\%) we would reject the null at 10\% but not
at 5\% or 1\%.
- If the p-value is 0.02 (2\%) we would reject the null at 10\% and 5\%
but not at 1\%.
- If the p-value is 0.001 (0.1\%) we would reject the null at 10\%, 5\%,
and 1\%.
The p-value of a test is simple to calculate from the test statistic and
its distribution under the null. I won't go through that calculation
here.
:::
## Application: the mean
Having described the general framework and a single example, we now
move on to the most common application: constructing hypothesis
tests and confidence intervals on the mean in a random sample.
Let $D = (x_1,\ldots,x_n)$ be a random sample of size $n$
on some random variable $x_i$ with unknown mean $E(x_i) = \mu_x$
and variance $var(x_i) = \sigma_x^2$.
Let the sample average be:
$$\bar{x}_n = \frac{1}{n} \sum_{i=1}^n x_i$$
and the sample variance:
$$s_x^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2$$
Both of these statistics are easily calculated from the data,
and we have previously discussed their properties in detail.
### The null and alternative hypotheses
Suppose that you want to test the null hypothesis:
$$H_0: \mu_x = 1$$
against the alternative hypothesis
$$H_1: \mu_x \neq 1$$
Having stated our null and alternative hypotheses, we need to
construct a test statistic.
Remember that our test statistic needs to have a *known* distribution
under the null, and a *different* distribution under the alternative.
### The T statistic
The typical test statistic we use in this setting is called the
***T statistic***, and takes the form:
$$t_n = \frac{\bar{x}_n - 1}{s_x/\sqrt{n}}$$
The idea here is that we take our estimate of the parameter ($\bar{x}$),
subtract its expected value under the null ($1$), and divide
by an estimate of its standard deviation ($s_x/\sqrt{n}$). We can
add and subtract the unknown true mean $\mu_x$ to get:
\begin{align}
t_n &= \frac{\bar{x}_n - \mu_x + \mu_x - 1}{s_x/\sqrt{n}} \\
&= \frac{\bar{x}_n - \mu_x}{s_x/\sqrt{n}} + \frac{\mu_x - 1}{s_x/\sqrt{n}}
\end{align}
The first part of this expression is a random variable with a mean of zero
and a variance of (about) one. The second part of the expression is
exactly zero when $H_0$ is true, and not exactly zero when it is false.
Recall that we need the probability distribution of $t_n$ to be known when
$H_0$ is true, and different when it is false. The second criterion is
clearly met, and the first criterion is met if we can find the
probability distribution of $\frac{\bar{x}_n - \mu_x}{s_x/\sqrt{n}}$.
Unfortunately, if we don't know the exact probability distribution
of $x_i$ we don't know the exact probability distribution of statistics
calculated from it. Once we have a potential test statistic, there
are two standard solutions to this problem:
1. Assume a specific probability distribution (usually a normal
distribution) for $x_i$. We can (or at least a
professional statistician can) then mathematically derive
the distribution of any test statistic from this distribution.
2. Use the central limit theorem to get an approximate probability
distribution.
We will explore both of those options.
### Asymptotic critical values
We will start with the asymptotic solution to the problem. As we learned in
Chapter \@ref(statistics), the Central Limit Theorem tells us that:
$$\frac{\bar{x}_n - \mu_x}{\sigma_x/\sqrt{n}} \rightarrow N(0,1)$$
Under the null our test statistic looks just like this, but with the sample
standard deviation $s_x$ in place of the population standard deviation
$\sigma_x$. It turns out that Slutsky's theorem allows us to make
this substitution, and it can be proved that:
$$\frac{\bar{x}_n - \mu_x}{s_x/\sqrt{n}} \rightarrow N(0,1)$$
Therefore, under the null:
$$t_n \rightarrow N(0,1)$$
In other words, while we do not know the exact (finite-sample) distribution
of $t_n$ we know that $N(0,1)$ provides a useful asymptotic approximation
to that distribution.
```{r NormalDistribution, fig.cap = "*Asymptotic distribution of t_n under the null*"}
simdata <- tibble(x = seq(-3,3,length.out = 100),
tinf = dnorm(x))
ggplot(data=simdata,
mapping = aes(x= x,
y= tinf)) +
geom_line() +
xlab("value") +
ylab("PDF of t_n") +
labs(title = "Asymptotic distribution of t_n under null",
subtitle = "",
caption = "",
tag = "")
```
Therefore, if we want a test that has the ***asymptotic size*** of 5\%, we
can use Excel or R to calculate critical values. In Excel, the function
would be `NORM.INV` or `NORM.S.INV`, and the formulas would be:
- $c_L$: `=NORM.S.INV(0.025)` or `=NORM.INV(0.025,0,1)`.
- $c_H$: `=NORM.S.INV(0.975)` or `=NORM.INV(0.975,0,1)`.
The calculations below were done in R:
```{r AsymptoticCriticalValues}
cat("cL = 2.5 percentile of N(0,1) = ",
round(qnorm(0.025),3),
"\n")
cat("cH = 97.5 percentile of N(0,1) = ",
round(qnorm(0.975),3),
"\n")
```
These particular critical values are so commonly used that I want you
to remember them.
### Exact critical values
Most economic data comes in sufficiently large samples that the asymptotic
distribution of $t_n$ is a reasonable approximation and the asymptotic
test works well. But occasionally we have samples that are small enough
that it doesn't.
Another option is to assume that the $x_i$ variables are normally
distributed:
$$x_i \sim N(\mu_x,\sigma_x^2)$$
where $\mu_x$ and $\sigma_x^2$ are unknown parameters.
Now at this point it is important to remind you: many interesting
variables are *not* normally distributed (for example, our roulette outcome
is discrete uniform, and the result of a given bet is Bernoulli)
and so this assumption may very well be incorrect.
If this was a more advanced course I would derive the
distribution of $t_n$ under the null. But for ECON 233, I will
just ask you to understand that it *can* be derived once we
assume normality of the $x_i$.
The null distribution of this particular test statistic under these particular assumptions was derived in the 1920's by William Sealy Gosset, a statistician
working at the Guiness brewery. To avoid getting in trouble at work (Guinness
did not want to give away trade secrets) Gosset published under the pseudonym
"Student". As a result, the family of distributions he derived is called
"Student's T distribution".
When the null is true, the test statistic $t_n$ as described above
has Student's T distribution with $n-1$ degrees of freedom:
$$t_n \sim T_{n-1}$$
As always, it has a different distribution (sometimes called the "noncentral
T distribution") when the null is false.
The $T_{n-1}$ distribution looks a lot like the $N(0,1)$
distribution, but has slightly higher probability of extreme
positive or negative values. As $n$ increases the
$T_{n-1}$ distribution converges to the
$N(0,1)$ distribution, just as predicted by the central
limit theorem.
```{r TDistribution, fig.cap = "*Exact distribution of t_n under the null*"}
simdata <- tibble(x = seq(-3,3,length.out = 100),
t5 = dt(x,df=4),
t10 = dt(x,df=9),
t30 = dt(x,df=29),
tinf = dnorm(x))
ggplot(data = simdata,
mapping = aes(x = x)) +
geom_line(aes(y = t5), col = "green") +
geom_line(aes(y = t10), col = "blue") +
geom_line(aes(y = t30), col = "red") +
geom_line(aes(y = tinf), col = "black") +
geom_text(label="n = infinity",x=0.75,y=0.4,col="black") +
geom_text(label="n = 5",x=2.5,y=0.1,col="green") +
geom_text(label="n = 10",x=1.7,y=0.2,col="blue") +
geom_text(label="n = 30",x=1.2,y=0.3,col="red") +
xlab("value") +
ylab("PDF") +
labs(title = "Exact distribution of t_n under null",
subtitle = "",
caption = "(x assumed to be normally distributed)",
tag = "")
```
Having found our test statistic and its distribution under the null,
we can calculate our critical values:
$$c_L = 2.5 \textrm{ percentile of } T_{n-1}$$
$$c_H = 97.5 \textrm{ percentile of } T_{n-1}$$
We can obtain these percentiles using Excel or R. In Excel, the
relevant function is `T.INV`. For example, if we have 5 observations, then:
- We would calculate $c_L$ by the formula `=T.INV(0.025,4)`.
- We would calculate $c_H$ by the formula `=T.INV(0.975,4)`.
The results (calculated below using R) would be:
```{r T4CriticalValues}
cat("cL = 2.5 percentile of T_4 = ",
round(qt(0.025,df=4),3),
"\n")
cat("cH = 97.5 percentile of T_4 = ",
round(qt(0.975,df=4),3),
"\n")
```
In contrast, if we have 30 observations, then:
- We would calculate $c_L$ by the formula `=T.INV(0.025,29)`.
- We would calculate $c_H$ by the formula `=T.INV(0.975,29)`.
The results (calculated below using R) would be:
```{r T29CriticalValues}
cat("cL = 2.5 percentile of T_29 = ",
round(qt(0.025,df=29),3),
"\n")
cat("cH = 97.5 percentile of T_29 = ",
round(qt(0.975,df=29),3),
"\n")
```
and if we have 1,000 observations:
- We would calculate $c_L$ by the formula `=T.INV(0.025,999)`.
- We would calculate $c_H$ by the formula `=T.INV(0.975,999)`.
The results (calculated below using R) would be:
```{r T999CriticalValues}
cat("cL = 2.5 percentile of T_999 = ",
round(qt(0.025,df=999),3),
"\n")
cat("cH = 97.5 percentile of T_999 = ",
round(qt(0.975,df=999),3),
"\n")
```
Notice that the Student's T test is more ***conservative** (less likely
to reject) than the asymptotic test for smaller sample sizes. But
at some point (around $n = 30$) the difference between the two tests
becomes negligible.
In practice, most data sets in economics have well over 30 observations so
economists tend to use asymptotic tests unless they have a very small
sample.
## Confidence intervals
Hypothesis tests have one very important limitation: although they
allow us to rule out $\theta = \theta_0$ for a single value of $\theta_0$,
they say nothing about other values very close to $\theta_0$.
For example, suppose you are a medical researcher trying to measure the
effect of a particular treatment, let $\theta$ be the treatment, and
suppose that you have tested the null hypothesis that the treatment
has no effect ($\theta = 0$).
- If you reject this null, you have concluded that the effect has *some*
effect. However, that does not rule out the possibility that
the effect of the treatment is *very small*.
- If you fail to reject this null, you cannot rule out the possibility
that the treatment has *no* effect. However, this does not rule out
the possibility that the effect is *very large*.
The solution to this would be to do a hypothesis test for every possible
value of $\theta$, and classify them into values that were rejected
and not rejected. This is the idea of a ***confidence interval***.
A confidence interval for the parameter $\theta$ with coverage
rate $CP$ is an interval with lower bound $CI_L$ and
upper bound $CI_H$ constructed from the data in such a way that
$$\Pr(CI_L < \theta < CI_H) = CP$$
In economics and most other social sciences, the convention is to report
confidence intervals with a coverage rate of 95\%.
$$\Pr(CI_L < \theta < CI_H) = 0.95$$
Note that $\theta$ is a fixed (but unknown) parameter, while $CI_L$
and $CI_H$ are statistics calculated from the data.
How do we calculate confidence intervals? It turns out to be
entirely straightforward: confidence intervals can be constructed
by inverting hypothesis tests:
- The 95\% confidence interval is all values that cannot be
rejected at a 5\% level of significance.
- The 90\% confidence interval all values that cannot be
rejected at a 10\% level of significance.
- It is *narrower* than the 95\% confidence interval.
- The 99\% confidence interval is all values that cannot be
rejected at a 1\% level of significance.
- It is *wider* than the 95\% confidence interval.
::: example
**A confidence interval for the win probability**
Calculating a confidence interval for $p_{red}$ is somewhat
tricky to do by hand, but easy to do on a computer:
1. Construct a grid of many values between 0 and 1.
2. For each value $p_0$ in the grid, test the null hypothesis
$H_0: p_{red} = p_0$ against the alternative hypothesis
$H_1: p_{red} \neq p_0$.
3. The confidence interval is the range of values for $p_0$
that are not rejected.
For example, suppose that red wins on 40 of the 100 games. Then
a 95\% confidence interval for $p_{red}$ is:
```{r RouletteCI}
theta <- seq(0,1,length.out=101)
cL <- qbinom(0.025,100,theta)
cH <- qbinom(0.975,100,theta)
thetaCI <- range(theta[cL < 40 & cH > 40])
cat(thetaCI[1]," to ",thetaCI[2],"\n")
```
Notice that the confidence interval includes the fair value of $0.486$
but it also includes some very unfair values. In other words, while
we are unable to rule out the possibility that we have a fair game,
the evidence that we have a fair game is not very strong.
:::
### Confidence intervals for the mean
Confidence intervals for the mean are very easy to calculate. Again
we construct them by inverting the hypothesis test.
Pick any $\mu_0$. To test the null
$$H_0: \mu_x = \mu_0$$
our test statistic is:
$$t_n = \sqrt{n}\frac{\bar{x}-\mu_0}{s_x}$$
and we fail to reject the null if
$$c_L < t_n < c_H$$
where $c_L$ and $c_H$ are our critical values.
Plugging $t_n$ to this expression we fail to reject the null whenever:
$$c_L < \sqrt{n}\frac{\bar{x}-\mu_0}{s_x} < c_H$$
Solving for $\mu_0$ we fail to reject whenever:
$$\bar{x} - c_H s_x/\sqrt{n} < \mu_0 < \bar{x} - c_L s_x/\sqrt{n}$$
All that remains is to choose a confidence/size level, and
decide whether to use an asymptotic or finite sample test.
If we are using the asymptotic approximation to construct
a 95\% confidence interval, then the 5\% asymptotic critical values
are $c_L = -1.96$ and $c_H \approx 1.96$ and the
confidence interval is:
$$CI = \bar{x} \pm 1.96 s_x/\sqrt{n}$$
In other words, the 95\% confidence interval for $\mu_x$
is just the point estimate plus or minus roughly 2 standard
errors.
If we have a small sample, and choose to assume normality rather
than using the asymptotic approximation, then we need to use the
slightly larger critical values from the $T_{n-1}$ distribution.
For example, if $n=5$, then $c_L \approx -2.78$,
$c_H \approx 2.78$ and the 95\% confidence interval is:
$$CI = \bar{x} \pm 2.78 s_x/\sqrt{n}$$
As with hypothesis tests, finite sample confidence intervals
are typically more conservative (wider) than their asymptotic
cousins, but the difference becomes negligible as
the sample size increases.
## Chapter review {#review-inference}
To be added
### Practice problems {#problems-inference}
To be added.