10-Inference.Rmd

# Statistical inference {#statistical-inference}

```{r setup9, include=FALSE}
knitr::opts_chunk$set(echo = FALSE,
                      prompt = FALSE,
                      tidy = TRUE,
                      collapse = TRUE)
library("tidyverse")
```

In Chapter \@ref(statistics), we learned about estimation: the use of data and 
statistics to construct the best possible guess at the value of some parameter
$\theta$.

In this chapter, we will pursue a different goal.  Instead of estimating
the value of $\theta$ we will what values of $\theta$ can be confidently 
ruled out given the evidence available in our data:

- A hypothesis test determines whether a *particular* value *can* be 
  ruled out.
- A confidence interval determines a *range* of values that *cannot*
  be ruled out.

The set of procedures for constructing confidence intervals and 
hypothesis tests is called statistical ***inference***.

::: goals
**Chapter goals**

In this chapter we will learn how to:

 - Construct and perform a simple hypothesis test
 - Construct a confidence interval
 - Correctly interpret both hypothesis tests and confidence intervals
:::

## Principles of inference

### Evidence

The purpose of statistical inference is to systematically account for the
uncertainty associated with limited evidence.  That is, there are important
aspects of the data generating process we do not know. The data provide
some evidence about those unknown aspects, but the evidence they provide
may not be strong.  Statistical inference asks us what statements about 
the data generating process can be made with confidence based on the data.

::: example
**Roulette**

Suppose you work as a casino regulator for the BCLC (British Columbia
Lottery Corporation, the crown corporation that regulates all commercial
gambling in B.C.).  You have been given data with recent roulette 
results from a particular casino and are tasked with determining whether
the casino is running a fair game.

Before getting caught up in math, let's think about how we might assess 
evidence:

- If we have data from many games, and the win rate for gamblers is about
  the same as the win probability in a fair game, then we would conclude there 
  is strong evidence the game is *fair or close to fair*.
- If we have data from many games, and the win rate for gamblers is much 
  higher or lower than the win probability in a fair game, then we would
  conclude there is strong evidence that the game is *not fair*.
- If we have no data, or data from only a few games, we probably *can't tell*
  whether the game is fair or unfair

In this chapter we will formalize these basic ideas about evidence.
:::

### A basic framework

For the remainder of this chapter, suppose we have a data set 
$D_n = (x_1,x_2,\ldots,x_n)$ of size $n$.  The data comes from an
unknown data generating process that includes an unknown 
parameter of interest $\theta$. 

::: example
**DGP and parameter of interest for roulette**

Let $D_n = (x_1,\ldots,x_n)$ be a data set of results from 
$n = 100$ games of roulette at a local casino. More specifically,
let 
  $$x_i = I(\textrm{Red wins})$$
Our parameter of interest is the probability that
red wins:
  $$p_{red} = \Pr(x_i = 1) = E(x_i)$$
We know that red wins in a fair game with probability
$p_{red} = 18/37 \approx 0.486$. 
:::

## Hypothesis tests

We will start with hypothesis tests. The idea of a hypothesis test
is to determine whether the data rule out or ***reject*** a specific
value of the unknown parameter $\theta$.

Intuitively, if we have no (useful) data we cannot rule anything
out, but as we obtain more data, we can rule out more values.

### The null and alternative hypotheses

The first step in a hypothesis test is to define the 
***null hypothesis***.  The null hypothesis is a statement
about our parameter $\theta$ that takes the form:
  $$H_0: \theta = \theta_0$$
where $\theta_0$ is a specific number. This is the value of 
$\theta$ we are interested in ruling out.

The next step is to define the ***alternative hypothesis***. The 
alternative hypothesis defines every other value of $\theta$ 
we are allowing, and is usually written as:
  $$H_1: \theta \neq \theta_0$$
where $\theta_0$ is the same number as used in the null.

::: example
**Null and alternative for $p_{red}$**

In our roulette example, our null hypothesis is
that the game is fair:
  $$H_0: p_{red} = 18/37 \approx 0.486$$
and the alternative hypothesis is that it is not fair:
  $$H_1: p_{red} \neq 18/37$$
:::

Notice that there is something of an asymmetry between the null and 
alternative hypothesis: the null is typically (though not necessarily)
a single value and the alternative is every other possible value. 

:::fyi
**What null hypothesis to choose?**

Our framework here assumes that you already know what null hypothesis you
wish to test, but we might briefly consider how we might choose a null
hypothesis to test.

In some applications there are null hypotheses that are of clear interest
for that specific case:

- In our roulette example, the natural null to test is whether the 
  win probability matches that of a fair game ($p = p_{fair}$).
- When measuring the effect $\beta$ of one variable on another, the natural
  null to test is "no effect at all" ($\beta = 0$).
- When comparing the mean of some characteristic or outcome across two
  groups (for example, average wages of men and women), the natural null
  to test is that they are the same ($\mu_m = \mu_W$)
- In epidemiology, a contagious disease will tend to spread if its
  reproduction rate $R$ is greater than one, and decline if it
    is less than one, so $R = 1$ is a natural null to test.

If there is no obvious null hypothesis, it may make sense to test many 
null hypotheses and report all of the results.  There is nothing 
wrong with doing that.
:::

### The test statistic

Our next step is to construct a ***test statistic*** that can be 
calculated from our data. A valid test statistic for a given null 
hypothesis is a statistic $t_n$ that has the following two 
properties:

1. The probability distribution of $t_n$ ***under the null*** 
   (i.e., when $H_0$ is true) is *known*.
2. The probability distribution of $t_n$ ***under the alternative***
   (i.e., when $H_1$ is true) is *different* from its
   probability distribution under the null.

The test statistic is usually based on an estimator of the parameter, 
and is usually constructed so that it is typically close to zero 
when the null is true, and far from zero when the null is false. But
it does not need to be.

::: example
***A test statistic for roulette***

A natural test statistic for the win probability of a bet on red 
would be the corresponding win frequency in our data.  We could
use either the relative win frequency (which also happens to 
be the sample average):
  $$\hat{f}_{red} = \bar{x}_n = \frac{1}{n} \sum_{i=1}^n x_i$$
but it will be more convenient to use the absolute win frequency:
  $$t_n = n\hat{f}_{red} = n\bar{x}_n =\sum_{i=1}^n x_i$$
Next we need to find the probability distribution of $t_n$ 
under the null, and under the alternative.

In general, since $x_i \sim Bernoulli(p_{red})$ we have:
  $$t_n \sim Binomial(n,p_{red})$$
Remember that the $Binomial(n,p)$ distribution is the distribution corresponding
to the number of times an event with probability $p$ happens in $n$ independent
trials.

Under the null (when $H_0$ is true), $p_{red} = 18/37$ and so:
  $$t_n \sim Binomial(100,18/37)$$
Since this distribution does not involve any unknown parameters,
our test statistic satisfies the requirement of having a known
distribution under the null.

Under the alternative (when $H_1$ is true), $p_{red}$ can take on any value
*other* than $18/37$.  The sample size is still $n=100$, so the
distribution of the test statistic is:
  $$t_n \sim Binomial(100,p_{red}) \textrm{ where $p_{red} \neq 18/37$ }$$
Notice that the distribution of our test statistic under the alternative
is not known, since $p_{red}$ is not known.  But the distribution is 
*different* under the alternative, and that is what we require from
our test statistic.
:::

### Significance and critical values

After choosing a test statistic $t_n$, the next step is to choose 
***critical values***.  The critical values are two numbers
$c_L$ and $c_H$ (where $c_L < c_H$) such that 

- $t_n$ has a *high* probability of being between $c_L$ and $c_H$ when
  the null is true.
- $t_n$ has a *lower* probability of being between $c_L$ and $c_H$ when
  the alternative is true.
  
The range of values from $c_L$ to $c_H$ is called the ***critical range***
of our test.

Given the test statistic and critical values:

- We ***reject the null*** if $t_n$ is outside of  the critical range.
  - This means we have strong evidence that $H_0$ is false.
  - The reason we reject here is that we know we would be unlikely to
    observe such a value of $t_n$ if $H_0$ were true.
- We ***fail to reject the null*** if $t_n$ is inside of the critical
  range.
  - This means we do not have strong evidence that $H_0$ is false.
  - This does not mean we have strong evidence that $H_0$ is true. We 
    may just not have enough evidence to reach a conclusion.

Notice that there is an asymmetry here: in the absence of evidence,
we will not reject *any* null hypotheses.

![Critical values and rejecting the null hypothesis](bin/reject.png)

How do we choose critical values? You can think of critical values as setting
a standard of evidence, so we need to balance two considerations:

- The probability of rejecting a false null is called the ***power***
  of the test.
  - We want our test to reject the null when it is false, so power is good.
- The probability of rejecting a true null is called the ***size***
  or ***significance*** of a test.
  - We do not want our test to reject the null when it is true, so size 
    is bad.
- There is always a trade off between power and size 
  - A *narrower* critical range (higher $c_L$ or lower $c_H$) will increase 
    the rejection rate, increasing both power (good) and size (bad).
  - A *wider* critical range (lower $c_L$ or higher $c_H$) will reduce 
    the rejection rate, reducing both power (bad) and size (good).
  
Given this trade off between power and size, we might construct some
criterion that includes both (just like MSE includes both variance and
bias) and choose critical values to maximize that criterion.  In practice, 
we do not typically do that.  

Instead, we follow a simple convention:

1. Set the size to a fixed value $\alpha$.
   - In general, the conventional size varies by field, and typically
     varies with how much data is typical in that field.
   - In economics and most other social sciences, the usual convention 
     is to use a size of 5\% ($\alpha = 0.05$).
   - We sometimes see 1\% ($\alpha = 0.01$) when working with larger data sets 
     or 10\% ($\alpha = 0.10$) when working with small data sets.
   - In physics or genetics, where data sets are much larger, the 
     conventional size is much lower. 
2. Calculate critical values that imply the desired size.  
   - With a size of 5\% $(\alpha = 0.05)$, we would:
     - set $c_L$ to the 2.5 percentile (0.025 quantile) of the null distribution
     - set $c_H$ to the 97.5 percentile (0.975 quantile) of the null distribution
   - With a size of 10\% $(\alpha = 0.10)$, we would:
     - set $c_L$ to the 5 percentile (0.05 quantile) of the null distribution
     - set $c_H$ to the 95 percentile (0.95 quantile) of the null distribution
   - With a size of $\alpha$, we would:
     - set $c_L$ to the $\alpha/2$ quantile of the null distribution
     - set $c_H$ to the $1-\alpha/2$ quantile of the null distribution

In other words, we set size equal to a conventional value, and let the 
power be whatever is implied by that.

::: example
**Critical values for roulette**

We earlier showed that the distribution of $t_n$ under the null is:
  $$t_n \sim Binomial(100,18/37)$$
We can get a size of 5\% by choosing:
  $$c_L = 2.5 \textrm{ percentile of } Binomial(100,18/37)$$
  $$c_H = 97.5 \textrm{ percentile of } Binomial(100,18/37)$$
We can then use Excel or R to calculate these critical values. In Excel,
the function you would use is `BINOM.INV()`

- The formula to calculate $c_L$ is `=BINOM.INV(100,18/37,0.025)`
- The formula to calculate $c_H$ is `=BINOM.INV(100,18/37,0.975)`

The calculations below were done in R:
```{r BinomialCriticalValues}
cat("2.5 percentile of binomial(100,18/37) =",
    qbinom(0.025,100,18/37),
    "\n")
cat("97.5 percentile of binomial(100,18/37) =",
    qbinom(0.975,100,18/37),
    "\n")
```
In other words we reject the null (at 5\% significance) that 
the roulette wheel is fair if red wins fewer than 39 games 
or more than 58 games.
:::


::: fyi
**A general test for a single probability**

We can generalize the test we have constructed so far to the case
of the probability of any event:

| Test component         |  Roulette example                    |General case               |
|------------------------|--------------------------------------|---------------------------|
| Parameter              | $p_{red} = \Pr(\textrm{Red wins})$   | $p = \Pr(\textrm{event})$ |
| Null hypothesis        | $H_0:p_{red} = 18/37$                | $H_0:p = p_0$             | 
| Alternative hypothesis | $H_1: p_{red} \neq 18/37$            |$H_1: p \neq p_0$          | 
| Test statistic         | $t = n\hat{f}_{RED}$           | $t = n\hat{f}_{\textrm{event}}$ | 
| Null distribution      | $Binomial(100,18/37)$                |$Binomial(n,p_0)$          | 
| Critical value $c_L$   | 39                        | 2.5 percentile of $Binomial(n,p_0)$  |
| Critical value $c_H$   | 58                        | 97.5 percentile of $Binomial(n,p_0)$ |
| Decision               | Reject if $t \notin [39,58]$    | Reject if $t \notin [c_L,c_H]$ |
:::

### The power of a test

As mentioned above, the power of a test is defined as the probability 
of rejecting the null when it is false, and the alternative is true.

The size of a test is a number, since the distribution of the test statistic
is known under the null.  Since the alternative typically allows
more than one value of the parameter $\theta$, the power
of a test is not a number but a *function* of the unknown true value
of $\theta$ (and sometimes other unknown features of the DGP):
  $$power(\theta) = \Pr(\textrm{reject $H_0$})$$
In some cases we can actually calculate this function.

::: example
**The power curve for roulette**

Power curves can be tricky to calculate, and I will not ask you to calculate
them for this course.  But they can be calculated, and it is useful to
see what they look like.

Figure \@ref(fig:PowerCurves) below depicts the power curve for the roulette test we have
just constructed; that is, we are testing the null that $p_{red} = 18/37$
at a 5\% size.  The blue line depicts the power curve for $n=100$ 
as in our example, while the green line depicts the power curve for
$n=20$.
```{r PowerCurves, fig.cap = "*Power curves for the roulette example*"}
PowerCurveData <- tibble(theta = seq(0,1,length.out=380),
                         power20 = pbinom(qbinom(0.025,
                                                 20,
                                                 18/37),
                                          20,
                                          theta) +
                             (1 - pbinom(qbinom(0.975,
                                              20,
                                              18/37),
                                       20,
                                       theta)),
                         power100 = pbinom(qbinom(0.025,
                                                  100,
                                                  18/37),
                                           100,
                                           theta) +
                           (1 - pbinom(qbinom(0.975,
                                              100,
                                              18/37),
                                       100,
                                       theta)))
ggplot(data = PowerCurveData, 
       mapping = aes(x = theta,y = power20)) +
  geom_line(col="green") +
  geom_line(aes(y = power100), col="blue") +
  xlab("true probability of winning") +
  ylab("power") +
  geom_text(label="n=100",x=0.7,y=0.8,col="blue") +
  geom_text(label="n=20",x=0.9,y=0.8,col="green") +
  geom_hline(yintercept = 0.05,col="gray") +
  geom_vline(xintercept = 18/37,col="gray") +
  labs(title = "Power curve for fair roulette wheel", 
       subtitle = "", 
       caption = "H0: Pr(red wins) = 18/37, significance = 0.05", 
       tag = "")
```
There are a few features I would like you to notice, all of which
are common to most regularly used tests:

- The power curve reaches its lowest value at the red point 
  $(18/37,0.05)$.  Note that $18/37$ is the parameter value
  under the null, and $0.05$ is the size of the test. In other words:
  - The power is always at least as big as the size, and is usually bigger.
  - We are more likely to reject the null when it is false than when it 
    is true. That's good!
  - When a test has this desirable property, we call 
    it an ***unbiased*** test. 
- The power increases as $\theta$ gets further from the null.  
  - That is, we are more likely to detect unfairness in a game 
    that is *very* unfair than when in one that is *a little* unfair.
- Power also increases with the sample size; the blue line
  ($n = 100$) is above the green line ($n = 20$).
  
Power analysis is often used by researchers to determine how much data 
to collect.  Each additional observation increases power but costs money,
so it is important to spend enough to get clear results but not
much more than that.
:::

::: fyi
**P values**

The convention of always using a 5\% significance level for 
hypothesis tests is somewhat arbitrary and has some negative 
unintended consequences:

  1. Sometimes a test statistic falls just below or just above the
    critical value, and small changes in the analysis can change 
    a result from reject to cannot-reject.
  2. In many fields, unsophisticated researchers and journal 
    editors misinterpret "cannot reject the null" as "the null is true."
    
One common response to these issues is to report what is called
the ***p-value*** of a test.  The p-value of a test is defined
as the significance level at which one would switch from rejecting
to not-rejecting the null.  For example:

- If the p-value is 0.43 (43\%) we would not reject the null at 10\%,
  5\% or 1\%.
- If the p-value is 0.06 (6\%) we would reject the null at 10\% but not
  at 5\% or 1\%.
- If the p-value is 0.02 (2\%) we would reject the null at 10\% and 5\%
  but not at 1\%.
- If the p-value is 0.001 (0.1\%) we would reject the null at 10\%, 5\%,
  and 1\%.
  
The p-value of a test is simple to calculate from the test statistic and 
its distribution under the null.  I won't go through that calculation 
here.
:::


## Application: the mean

Having described the general framework and a single example, we now
move on to the most common application: constructing hypothesis
tests and confidence intervals on the mean in a random sample.

Let $D = (x_1,\ldots,x_n)$ be a random sample of size $n$
on some random variable $x_i$ with unknown mean $E(x_i) = \mu_x$
and variance $var(x_i) = \sigma_x^2$.  

Let the sample average be:
  $$\bar{x}_n = \frac{1}{n} \sum_{i=1}^n x_i$$
and the sample variance:
  $$s_x^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2$$
Both of these statistics are easily calculated from the data,
and we have previously discussed their properties in detail.

### The null and alternative hypotheses

Suppose that you want to test the null hypothesis:
  $$H_0: \mu_x = 1$$
against the alternative hypothesis 
  $$H_1: \mu_x \neq 1$$
Having stated our null and alternative hypotheses, we need to
construct a test statistic.

Remember that our test statistic needs to have a *known* distribution
under the null, and a *different* distribution under the alternative.

### The T statistic

The typical test statistic we use in this setting is called the 
***T statistic***, and takes the form:
  $$t_n = \frac{\bar{x}_n - 1}{s_x/\sqrt{n}}$$
The idea here is that we take our estimate of the parameter ($\bar{x}$),
subtract its expected value under the null ($1$), and divide
by an estimate of its standard deviation ($s_x/\sqrt{n}$). We can
add and subtract the unknown true mean $\mu_x$ to get: 
  \begin{align}
    t_n &= \frac{\bar{x}_n - \mu_x + \mu_x - 1}{s_x/\sqrt{n}} \\
      &= \frac{\bar{x}_n - \mu_x}{s_x/\sqrt{n}} + \frac{\mu_x - 1}{s_x/\sqrt{n}} 
  \end{align}
The first part of this expression is a random variable with a mean of zero
and a variance of (about) one.  The second part of the expression is
exactly zero when $H_0$ is true, and not exactly zero when it is false. 

Recall that we need the probability distribution of $t_n$ to be known when
$H_0$ is true, and different when it is false.  The second criterion is
clearly met, and the first criterion is met if we can find the 
probability distribution of $\frac{\bar{x}_n - \mu_x}{s_x/\sqrt{n}}$.

Unfortunately, if we don't know the exact probability distribution
of $x_i$ we don't know the exact probability distribution of statistics
calculated from it.  Once we have a potential test statistic, there 
are two standard solutions to this problem:
  
  1. Assume a specific probability distribution (usually a normal
    distribution) for $x_i$.  We can (or at least a 
    professional statistician can) then mathematically derive 
    the distribution of any test statistic from this distribution.
  2. Use the central limit theorem to get an approximate probability
    distribution.

We will explore both of those options.

### Asymptotic critical values

We will start with the asymptotic solution to the problem. As we learned in 
Chapter \@ref(statistics), the Central Limit Theorem tells us that:
  $$\frac{\bar{x}_n - \mu_x}{\sigma_x/\sqrt{n}} \rightarrow N(0,1)$$
Under the null our test statistic looks just like this, but with the sample
standard deviation $s_x$ in place of the population standard deviation 
$\sigma_x$.  It turns out that Slutsky's theorem allows us to make
this substitution, and it can be proved that:
  $$\frac{\bar{x}_n - \mu_x}{s_x/\sqrt{n}} \rightarrow N(0,1)$$

Therefore, under the null:
  $$t_n \rightarrow N(0,1)$$
In other words, while we do not know the exact (finite-sample) distribution
of $t_n$ we know that $N(0,1)$ provides a useful asymptotic approximation 
to that distribution.

```{r NormalDistribution, fig.cap = "*Asymptotic distribution of t_n under the null*"}
simdata <- tibble(x = seq(-3,3,length.out = 100),
                  tinf = dnorm(x))
ggplot(data=simdata,
       mapping = aes(x= x,
                     y= tinf)) +
  geom_line() +
  xlab("value") +
  ylab("PDF of t_n") +
  labs(title = "Asymptotic distribution of t_n under null", 
       subtitle = "", 
       caption = "", 
       tag = "")
```

Therefore, if we want a test that has the ***asymptotic size*** of 5\%, we
can use Excel or R to calculate critical values. In Excel, the function
would be `NORM.INV` or `NORM.S.INV`, and the formulas would be:

- $c_L$: `=NORM.S.INV(0.025)` or `=NORM.INV(0.025,0,1)`.
- $c_H$: `=NORM.S.INV(0.975)` or `=NORM.INV(0.975,0,1)`.

The calculations below were done in R:
```{r AsymptoticCriticalValues}
  cat("cL = 2.5 percentile of N(0,1) = ",
      round(qnorm(0.025),3),
      "\n")
  cat("cH = 97.5 percentile of N(0,1) = ",
      round(qnorm(0.975),3),
      "\n")
```
These particular critical values are so commonly used that I want you
to remember them.

### Exact critical values 

Most economic data comes in sufficiently large samples that the asymptotic
distribution of $t_n$ is a reasonable approximation and the asymptotic
test works well. But occasionally we have samples that are small enough
that it doesn't.  

Another option is to assume that the $x_i$ variables are normally
distributed:
  $$x_i \sim N(\mu_x,\sigma_x^2)$$
where $\mu_x$ and $\sigma_x^2$ are unknown parameters.

Now at this point it is important to remind you: many interesting 
variables are *not* normally distributed (for example, our roulette outcome
is discrete uniform, and the result of a given bet is Bernoulli)
and so this assumption may very well be incorrect.

If this was a more advanced course I would derive the 
distribution of $t_n$ under the null.  But for ECON 233, I will
just ask you to understand that it *can* be derived once we 
assume normality of the $x_i$.

The null distribution of this particular test statistic under these particular assumptions was derived in the 1920's by William Sealy Gosset, a statistician 
working at the Guiness brewery.  To avoid getting in trouble at work (Guinness 
did not want to give away trade secrets) Gosset published under the pseudonym 
"Student".  As a result, the family of distributions he derived is called
"Student's T distribution".

When the null is true, the test statistic $t_n$ as described above
has Student's T distribution with $n-1$ degrees of freedom:
  $$t_n \sim T_{n-1}$$
As always, it has a different distribution (sometimes called the "noncentral
T distribution") when the null is false.

The $T_{n-1}$ distribution looks a lot like the $N(0,1)$
distribution, but has slightly higher probability of extreme
positive or negative values.  As $n$ increases the 
$T_{n-1}$ distribution converges to the 
$N(0,1)$ distribution, just as predicted by the central 
limit theorem.

```{r TDistribution, fig.cap = "*Exact distribution of t_n under the null*"}
simdata <- tibble(x = seq(-3,3,length.out = 100),
                  t5 = dt(x,df=4),
                  t10 = dt(x,df=9),
                  t30 = dt(x,df=29),
                  tinf = dnorm(x))
ggplot(data = simdata,
       mapping = aes(x = x)) +
  geom_line(aes(y = t5), col = "green") +
  geom_line(aes(y = t10), col = "blue") +
  geom_line(aes(y = t30), col = "red") +
  geom_line(aes(y = tinf), col = "black") +
  geom_text(label="n = infinity",x=0.75,y=0.4,col="black") +
  geom_text(label="n = 5",x=2.5,y=0.1,col="green") +
  geom_text(label="n = 10",x=1.7,y=0.2,col="blue") +
  geom_text(label="n = 30",x=1.2,y=0.3,col="red") +
  xlab("value") +
  ylab("PDF") +
  labs(title = "Exact distribution of t_n under null", 
       subtitle = "", 
       caption = "(x assumed to be normally distributed)", 
       tag = "")
```

Having found our test statistic and its distribution under the null, 
we can calculate our critical values:
  $$c_L = 2.5 \textrm{ percentile of } T_{n-1}$$
  $$c_H = 97.5 \textrm{ percentile of } T_{n-1}$$
We can obtain these percentiles using Excel or R. In Excel, the
relevant function is `T.INV`.  For example, if we have 5 observations, then:

- We would calculate $c_L$ by the formula `=T.INV(0.025,4)`.
- We would calculate $c_H$ by the formula `=T.INV(0.975,4)`.

The results (calculated below using R) would be:
```{r T4CriticalValues}
  cat("cL = 2.5 percentile of T_4 = ",
      round(qt(0.025,df=4),3),
      "\n")
  cat("cH = 97.5 percentile of T_4 = ",
      round(qt(0.975,df=4),3),
      "\n")
```
In contrast, if we have 30 observations, then:

- We would calculate $c_L$ by the formula `=T.INV(0.025,29)`.
- We would calculate $c_H$ by the formula `=T.INV(0.975,29)`.

The results (calculated below using R) would be:
```{r T29CriticalValues}
  cat("cL = 2.5 percentile of T_29 = ",
      round(qt(0.025,df=29),3),
      "\n")
  cat("cH = 97.5 percentile of T_29 = ",
      round(qt(0.975,df=29),3),
      "\n")
```
and if we have 1,000 observations:

- We would calculate $c_L$ by the formula `=T.INV(0.025,999)`.
- We would calculate $c_H$ by the formula `=T.INV(0.975,999)`.

The results (calculated below using R) would be:
```{r T999CriticalValues}
  cat("cL = 2.5 percentile of T_999 = ",
      round(qt(0.025,df=999),3),
      "\n")
  cat("cH = 97.5 percentile of T_999 = ",
      round(qt(0.975,df=999),3),
      "\n")
```

Notice that the Student's T test is more ***conservative** (less likely 
to reject) than the asymptotic test for smaller sample sizes. But
at some point (around $n = 30$) the difference between the two tests
becomes negligible.

In practice, most data sets in economics have well over 30 observations so
economists tend to use asymptotic tests unless they have a very small 
sample.

## Confidence intervals 

Hypothesis tests have one very important limitation: although they 
allow us to rule out $\theta = \theta_0$ for a single value of $\theta_0$, 
they say nothing about other values very close to $\theta_0$.

For example, suppose you are a medical researcher trying to measure the
effect of a particular treatment, let $\theta$ be the treatment, and 
suppose that you have tested the null hypothesis that the treatment
has no effect ($\theta = 0$).

- If you reject this null, you have concluded that the effect has *some*
  effect. However, that does not rule out the possibility that
  the effect of the treatment is *very small*.
- If you fail to reject this null, you cannot rule out the possibility
  that the treatment has *no* effect.  However, this does not rule out
  the possibility that the effect is *very large*.

The solution to this would be to do a hypothesis test for every possible
value of $\theta$, and classify them into values that were rejected 
and not rejected.  This is the idea of a ***confidence interval***.

A confidence interval for the parameter $\theta$ with coverage 
rate $CP$ is an interval with lower bound $CI_L$ and 
upper bound $CI_H$ constructed from the data in such a way that
  $$\Pr(CI_L < \theta < CI_H) = CP$$
In economics and most other social sciences, the convention is to report
confidence intervals with a coverage rate of 95\%. 
  $$\Pr(CI_L < \theta < CI_H) = 0.95$$
Note that $\theta$ is a fixed (but unknown) parameter, while $CI_L$ 
and $CI_H$ are statistics calculated from the data.  

How do we calculate confidence intervals? It turns out to be 
entirely straightforward: confidence intervals can be constructed
by inverting hypothesis tests: 

- The 95\% confidence interval is all values that cannot be 
  rejected at a 5\% level of significance.
- The 90\% confidence interval all values that cannot be 
  rejected at a 10\% level of significance. 
  - It is *narrower* than the 95\% confidence interval.
- The 99\% confidence interval is all values that cannot be 
  rejected at a 1\% level of significance. 
  - It is *wider* than the 95\% confidence interval.

::: example
**A confidence interval for the win probability**

Calculating a confidence interval for $p_{red}$ is somewhat 
tricky to do by hand, but easy to do on a computer:

1. Construct a grid of many values between 0 and 1.
2. For each value $p_0$ in the grid, test the null hypothesis
   $H_0: p_{red} = p_0$ against the alternative hypothesis
   $H_1: p_{red} \neq p_0$.
3. The confidence interval is the range of values for $p_0$
   that are not rejected.

For example, suppose that red wins on 40 of the 100 games. Then 
a 95\% confidence interval for $p_{red}$ is:
```{r RouletteCI}
theta <- seq(0,1,length.out=101)
cL <- qbinom(0.025,100,theta)
cH <- qbinom(0.975,100,theta)
thetaCI <- range(theta[cL < 40 & cH > 40])
cat(thetaCI[1]," to ",thetaCI[2],"\n")
```
Notice that the confidence interval includes the fair value of $0.486$
but it also includes some very unfair values.  In other words, while
we are unable to rule out the possibility that we have a fair game,
the evidence that we have a fair game is not very strong.
:::

### Confidence intervals for the mean

Confidence intervals for the mean are very easy to calculate. Again
we construct them by inverting the hypothesis test.

Pick any $\mu_0$.  To test the null 
  $$H_0: \mu_x = \mu_0$$
our test statistic is:
  $$t_n = \sqrt{n}\frac{\bar{x}-\mu_0}{s_x}$$
and we fail to reject the null if
  $$c_L < t_n < c_H$$
where $c_L$ and $c_H$ are our critical values.

Plugging $t_n$ to this expression we fail to reject the null whenever:
  $$c_L < \sqrt{n}\frac{\bar{x}-\mu_0}{s_x} < c_H$$
Solving for $\mu_0$ we fail to reject whenever:
  $$\bar{x} - c_H s_x/\sqrt{n} < \mu_0 < \bar{x} - c_L s_x/\sqrt{n}$$
All that remains is to choose a confidence/size level, and 
decide whether to use an asymptotic or finite sample test.

If we are using the asymptotic approximation to construct
a 95\% confidence interval, then the 5\% asymptotic critical values
are $c_L = -1.96$ and $c_H \approx 1.96$ and the 
confidence interval is:
  $$CI = \bar{x} \pm 1.96 s_x/\sqrt{n}$$
In other words, the 95\% confidence interval for $\mu_x$ 
is just the point estimate plus or minus roughly 2 standard
errors.  

If we have a small sample, and choose to assume normality rather
than using the asymptotic approximation, then we need to use the
slightly larger critical values from the $T_{n-1}$ distribution.
For example, if $n=5$, then $c_L \approx -2.78$,
$c_H \approx 2.78$ and the 95\% confidence interval is:
  $$CI = \bar{x} \pm 2.78 s_x/\sqrt{n}$$
As with hypothesis tests, finite sample confidence intervals
are typically more conservative (wider) than their asymptotic
cousins, but the difference becomes negligible as 
the sample size increases.

## Chapter review {#review-inference}

To be added

### Practice problems {#problems-inference}

To be added.