-
Notifications
You must be signed in to change notification settings - Fork 180
/
Copy pathmodel-logistic.qmd
576 lines (468 loc) · 29.3 KB
/
model-logistic.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
# Logistic regression {#sec-model-logistic}
```{r}
#| include: false
source("_common.R")
```
\vspace{-10mm}
::: {.chapterintro data-latex=""}
In this chapter we introduce **logistic regression**\index{logistic regression} as a tool for building models when there is a categorical response variable with two levels, e.g., yes and no.
Logistic regression is a type of **generalized linear model (GLM)**\index{generalized linear model} for response variables where regular multiple regression does not work very well.
GLMs can be thought of as a two-stage modeling approach.
We first model the response variable using a probability distribution, such as the binomial or Poisson distribution.
Second, we model the parameter of the distribution using a collection of predictors and a special form of multiple regression.
Ultimately, the application of a GLM will feel very similar to multiple regression, even if some of the details are different.
:::
```{r}
#| include: false
terms_chp_09 <- c("logistic regression", "generalized linear model")
```
\vspace{-8mm}
## Discrimination in hiring
We will consider experiment data from a study that sought to understand the effect of race and sex on job application callback rates [@bertrand2003].
To evaluate which factors were important, job postings were identified in Boston and Chicago for the study, and researchers created many fake resumes to send off to these jobs to see which would elicit a callback.[^09-model-logistic-1]
The researchers enumerated important characteristics, such as years of experience and education details, and they used these characteristics to randomly generate fake resumes.
Finally, they randomly assigned a name to each resume, where the name would imply the applicant's sex and race.
[^09-model-logistic-1]: We did omit discussion of some structure in the data for the analysis presented: the experiment design included blocking, where typically four resumes were sent to each job: one for each inferred race/sex combination (as inferred based on the first name).
We did not worry about the blocking aspect, since accounting for the blocking would *reduce* the standard error without notably changing the point estimates for the `race` and `sex` variables versus the analysis performed in the section.
That is, the most interesting conclusions in the study are unaffected even when completing a more sophisticated analysis.
::: {.data data-latex=""}
The [`resume`](http://openintrostat.github.io/openintro/reference/resume.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.
:::
The first names that were used and randomly assigned in the experiment were selected so that they would predominantly be recognized as belonging to Black or White individuals; other races were not considered in the study.
While no name would definitively be inferred as pertaining to a Black individual or to a White individual, the researchers conducted a survey to check for racial association of the names; names that did not pass the survey check were excluded from usage in the experiment.
You can find the full set of names that did pass the survey test and were ultimately used in the study in @tbl-resume-names.
For example, Lakisha was a name that their survey indicated would be interpreted as a Black woman, while Greg was a name that would generally be interpreted to be associated with a White male.
\clearpage
```{r}
#| label: tbl-resume-names
#| tbl-cap: |
#| List of all 36 unique names along with the commonly inferred
#| race and sex associated with these names.
#| tbl-pos: H
resume_names_full <- resume |>
select(firstname, race, gender) |>
distinct(firstname, .keep_all = TRUE) |>
arrange(firstname) |>
rownames_to_column() |>
mutate(
rowname = as.numeric(rowname),
column = cut(rowname, breaks = c(0, 12, 24, 36)),
race = str_to_title(race),
sex = if_else(gender == "f", "female", "male"),
column = as.numeric(column)
) |>
select(-rowname, -gender) |>
relocate(column)
resume_names_1 <- resume_names_full |>
filter(column == 1) |>
select(-column)
resume_names_2 <- resume_names_full |>
filter(column == 2) |>
select(-column)
resume_names_3 <- resume_names_full |>
filter(column == 3) |>
select(-column)
resume_names_1 |>
bind_cols(resume_names_2) |>
bind_cols(resume_names_3) |>
kbl(
linesep = "", booktabs = TRUE, align = "lllllllll",
col.names = c(
"first_name", "race", "sex",
"first_name", "race", "sex",
"first_name", "race", "sex"
)
) |>
kable_styling(
bootstrap_options = c("striped", "condensed"),
latex_options = c("striped"), full_width = FALSE
) |>
column_spec(4, border_left = T) |>
column_spec(7, border_left = T)
```
```{r}
#| label: resume-data-prep
resume <- resume |>
rename(sex = gender) |>
mutate(
sex = if_else(sex == "m", "man", "woman"),
sex = fct_relevel(sex, "woman", "man"),
received_callback = as.factor(received_callback),
college_degree = as.factor(college_degree),
honors = as.factor(honors),
military = as.factor(military),
has_email_address = as.factor(has_email_address),
race = if_else(race == "black", "Black", "White")
) |>
select(
received_callback, job_city, college_degree, years_experience,
honors, military, has_email_address, race, sex
)
```
The response variable of interest is whether there was a callback from the employer for the applicant, and there were 8 attributes that were randomly assigned that we'll consider, with special interest in the race and sex variables.
Race and sex are protected classes in the United States, meaning they are not legally permitted factors for hiring or employment decisions.
The full set of attributes considered is provided in @tbl-resume-variables.
```{r}
#| label: tbl-resume-variables
#| tbl-cap: |
#| Descriptions of nine variables from the `resume` dataset. Many
#| of the variables are indicator variables, meaning they take
#| the value 1 if the specified characteristic is present and
#| 0 otherwise.
#| tbl-pos: H
resume_variables <- tribble(
~variable, ~description,
"received_callback", "Specifies whether the employer called the applicant following submission of the application for the job.",
"job_city", "City where the job was located: Boston or Chicago.",
"college_degree", "An indicator for whether the resume listed a college degree.",
"years_experience", "Number of years of experience listed on the resume.",
"honors", "Indicator for the resume listing some sort of honors, e.g. employee of the month.",
"military", "Indicator for if the resume listed any military experience.",
"has_email_address", "Indicator for if the resume listed an email address for the applicant.",
"race", "Race of the applicant, implied by their first name listed on the resume.",
"sex", "Sex of the applicant (limited to only man and woman), implied by the first name listed on the resume."
)
resume_variables |>
kbl(linesep = "", booktabs = TRUE) |>
kable_styling(
bootstrap_options = c("striped", "condensed"),
full_width = FALSE
) |>
column_spec(1, monospace = TRUE) |>
column_spec(2, width = "30em")
```
All of the attributes listed on each resume were randomly assigned, which means that no attributes that might be favorable or detrimental to employment would favor one demographic over another on these resumes.
Importantly, due to the experimental nature of the study, we can infer causation between these variables and the callback rate, if substantial differences are found.
Our analysis will allow us to compare the practical importance of each of the variables relative to each other.
\clearpage
## Modelling the probability of an event {#sec-modelingTheProbabilityOfAnEvent}
Logistic regression is a generalized linear model where the outcome is a two-level categorical variable.
The outcome, $Y_i$, takes the value 1 (in our application, the outcome represents a callback for the resume) with probability $p_i$ and the value 0 with probability $1 - p_i$.
Because each observation has a slightly different context, e.g., different education level or a different number of years of experience, the probability $p_i$ will differ for each observation.
Ultimately, it is the **probability**\index{probability of an event} of the outcome taking the value 1 (i.e., being a "success") that we model in relation to the predictor variables: we will examine which resume characteristics correspond to higher or lower callback rates.
::: {.important data-latex=""}
**Notation for a logistic regression model.**
The outcome variable for a GLM is denoted by $Y_i$, where the index $i$ is used to represent observation $i$.
In the resume application, $Y_i$ will be used to represent whether resume $i$ received a callback ($Y_i=1$) or not ($Y_i=0$).
:::
```{r}
#| include: false
terms_chp_09 <- c(terms_chp_09, "probability of an event")
```
The predictor variables are represented as follows: $x_{1,i}$ is the value of variable 1 for observation $i$, $x_{2,i}$ is the value of variable 2 for observation $i$, and so on.
$$
transformation(p_i) = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + \cdots + \beta_k x_{k,i}
$$
We want to choose a **transformation**\index{transformation} in the equation that makes practical and mathematical sense.
For example, we want a transformation that makes the range of possibilities on the left hand side of the equation equal to the range of possibilities for the right hand side; if there was no transformation in the equation, the left hand side could only take values between 0 and 1, but the right hand side could take values well outside of the range from 0 to 1.
```{r}
#| include: false
terms_chp_09 <- c(terms_chp_09, "transformation")
```
A common transformation for $p_i$ is the **logit transformation**\index{logit transformation}, which may be written as
$$
logit(p_i) = \log_{e}\left( \frac{p_i}{1-p_i} \right)
$$
The **logit transformation**\index{logit transformation} is shown in @fig-logit-transformation.
Below, we rewrite the equation relating $Y_i$ to its predictors using the logit transformation of $p_i$:
```{r}
#| include: false
terms_chp_09 <- c(terms_chp_09, "logit transformation")
```
$$
\log_{e}\left( \frac{p_i}{1-p_i} \right) = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + \cdots + \beta_k x_{k,i}
$$
```{r}
#| label: fig-logit-transformation
#| fig-cap: Values of $p_i$ against values of $logit(p_i)$.
#| fig-alt: |
#| A scatterplot with a superimposed line connecting values of p from -5
#| to +6 (on the x-axis) with their logit transformed value (on the y-axis).
#| out-width: 100%
#| fig-asp: 0.45
logit_df_line <- tibble(
p = seq(0.0001, 0.9999, 0.0001),
lp = log(p / (1 - p))
)
logit_df_point <- tibble(
lp = -5:6,
p = round(exp(lp) / (exp(lp) + 1), 3)
) |>
mutate(
label = glue::glue("({lp}, {p})"),
label_pos = case_when(
lp %in% c(-5, -3, 4, 6) ~ "above",
lp %in% c(-4, 3, 5) ~ "below",
lp == -2 ~ "right",
lp %in% c(-1, 0, 1, 2) ~ "left",
)
)
ggplot(logit_df_line, aes(x = lp, y = p)) +
geom_hline(yintercept = 0, linetype = "dashed", color = IMSCOL["blue", "full"]) +
geom_hline(yintercept = 1, linetype = "dashed", color = IMSCOL["blue", "full"]) +
geom_line(linewidth = 0.5) +
coord_cartesian(
xlim = c(-5.5, 6.15),
ylim = c(-0.05, 1.1)
) +
geom_point(
data = logit_df_point, shape = "circle open",
size = 3, color = IMSCOL["red", "full"], stroke = 2
) +
geom_text(
data = logit_df_point |> filter(label_pos == "above"),
aes(label = label), vjust = -2
) +
geom_text(
data = logit_df_point |> filter(label_pos == "below"),
aes(label = label), vjust = 2.5
) +
geom_text(
data = logit_df_point |> filter(label_pos == "right"),
aes(label = label), hjust = -0.25
) +
geom_text(
data = logit_df_point |> filter(label_pos == "left"),
aes(label = label), hjust = 1.25
) +
labs(
x = expression(logit(p[i])),
y = expression(p[i])
)
```
In our resume example, there are 8 predictor variables, so $k = 8$.
While the precise choice of a logit function isn't intuitive, it is based on theory that underpins generalized linear models, which is beyond the scope of this book.
Fortunately, once we fit a model using software, it will start to feel like we are back in the multiple regression context, even if the interpretation of the coefficients is more complex.
To convert from values on the logistic regression scale to the probability scale, we need to back transform and then solve for $p_i$:
$$
\begin{aligned}
\log_{e}\left( \frac{p_i}{1-p_i} \right) &= \beta_0 + \beta_1 x_{1,i} + \cdots + \beta_k x_{k,i} \\
\frac{p_i}{1-p_i} &= e^{\beta_0 + \beta_1 x_{1,i} + \cdots + \beta_k x_{k,i}} \\
p_i &= \left( 1 - p_i \right) e^{\beta_0 + \beta_1 x_{1,i} + \cdots + \beta_k x_{k,i}} \\
p_i &= e^{\beta_0 + \beta_1 x_{1,i} + \cdots + \beta_k x_{k,i}} - p_i \times e^{\beta_0 + \beta_1 x_{1,i} + \cdots + \beta_k x_{k,i}} \\
p_i + p_i \text{ } e^{\beta_0 + \beta_1 x_{1,i} + \cdots + \beta_k x_{k,i}} &= e^{\beta_0 + \beta_1 x_{1,i} + \cdots + \beta_k x_{k,i}} \\
p_i(1 + e^{\beta_0 + \beta_1 x_{1,i} + \cdots + \beta_k x_{k,i}}) &= e^{\beta_0 + \beta_1 x_{1,i} + \cdots + \beta_k x_{k,i}} \\
p_i &= \frac{e^{\beta_0 + \beta_1 x_{1,i} + \cdots + \beta_k x_{k,i}}}{1 + e^{\beta_0 + \beta_1 x_{1,i} + \cdots + \beta_k x_{k,i}}}
\end{aligned}
$$
As with most applied data problems, we substitute in the point estimates (the observed $b_i$) to calculate relevant probabilities.
::: {.workedexample data-latex=""}
We start by fitting a model with a single predictor: `honors`.
This variable indicates whether the applicant had any type of honors listed on their resume, such as employee of the month.
A logistic regression model was fit using statistical software and the following model was found:
$$\log_e \left( \frac{\widehat{p}_i}{1-\widehat{p}_i} \right) = -2.4998 + 0.8668 \times {\texttt{honors}}$$
a. If a resume is randomly selected from the study and it does not have any honors listed, what is the probability it resulted in a callback?
b. What would the probability be if the resume did list some honors?
------------------------------------------------------------------------
a. If a randomly chosen resume from those sent out is considered, and it does not list honors, then `honors` takes the value of 0 and the right side of the model equation equals -2.4998. Solving for $p_i$: $\frac{e^{-2.4998}}{1 + e^{-2.4998}} = 0.076$. Just as we labeled a fitted value of $y_i$ with a "hat" in single-variable and multiple regression, we do the same for this probability: $\hat{p}_i = 0.076{}$.
b. If the resume had listed some honors, then the right side of the model equation is $-2.4998 + 0.8668 \times 1 = -1.6330$, which corresponds to a probability $\hat{p}_i = 0.163$. Notice that we could examine -2.4998 and -1.6330 in @fig-logit-transformation to estimate the probability before formally calculating the value.
:::
While knowing whether a resume listed honors provides some signal when predicting whether the employer would call, we would like to account for many different variables at once to understand how each of the different resume characteristics affected the chance of a callback.
\clearpage
## Logistic model with many variables
We used statistical software to fit the logistic regression model with all 8 predictors described in @tbl-resume-variables.
Like multiple regression, the result may be presented in a summary table, which is shown in @tbl-resume-full-fit.
```{r}
#| label: tbl-resume-full-fit
#| tbl-cap: |
#| Summary table for the full logistic regression model for the
#| resume callback example.
#| tbl-pos: H
resume_full_fit <- logistic_reg() |>
fit(received_callback ~ job_city + college_degree + years_experience + honors + military + has_email_address + race + sex, data = resume, family = "binomial")
resume_full_fit |>
tidy() |>
mutate(p.value = ifelse(p.value < 0.0001, "<0.0001", round(p.value, 4))) |>
kbl(
linesep = "", booktabs = TRUE,
digits = 2, align = "lrrrrr"
) |>
kable_styling(
bootstrap_options = c("striped", "condensed"),
latex_options = c("striped")
) |>
column_spec(1, width = "10em", monospace = TRUE) |>
column_spec(2:5, width = "5em")
```
Just like multiple regression, we could trim some variables from the model.
Here we'll use a statistic called **Akaike information criterion (AIC)**\index{AIC}\index{Akaike information criterion}, which is analogous to how we used adjusted $R^2$ in multiple regression.
AIC is a popular model selection method used in many disciplines, and is praised for its emphasis on model uncertainty and parsimony.
AIC selects a "best" model by ranking models from best to worst according to their AIC values.
In the calculation of a model's AIC, a penalty is given for including additional variables.
The penalty for added model complexity attempts to strike a balance between underfitting (too few variables in the model) and overfitting (too many variables in the model).
When using AIC for model selection, models with a lower AIC value are considered to be "better." Remember that when using adjusted $R^2$ we select models with higher values instead.
It is important to note that AIC provides information about the quality of a model relative to other models, but does not provide information about the overall quality of a model.
@tbl-resume-full-fit-aic provides the AIC and the number of observations used to fit the model. We also know from @tbl-resume-full-fit that eight variables (with nine coefficients, including the intercept) were fit.
```{r}
#| label: tbl-resume-full-fit-aic
#| tbl-cap: |
#| AIC for the full logistic regression model fit to the full
#| resume callback example.
#| tbl-pos: H
resume_full_fit <- logistic_reg() |>
fit(received_callback ~ job_city + college_degree + years_experience + honors + military + has_email_address + race + sex, data = resume, family = "binomial")
resume_full_fit |>
glance() |>
select(AIC, number_observations = nobs) |>
kbl(
linesep = "", booktabs = TRUE,
digits = 2, align = "rr"
) |>
kable_styling(
bootstrap_options = c("striped", "condensed"),
latex_options = c("striped"),
full_width = FALSE
) |>
column_spec(1:2, width = "10em", monospace = TRUE)
```
We will look for models with a lower AIC using a backward elimination strategy. @tbl-resume-fit-aic provides the AIC values for the model with variables as given in @tbl-resume-fit. Notice that the same number of observations are used, but one fewer variable (`college_degree` is dropped from the model).
```{r}
#| label: tbl-resume-fit-aic
#| tbl-cap: |
#| AIC for the logistic regression model fit to the
#| resume callback example without `college_degree`.
#| tbl-pos: H
resume_fit <- logistic_reg() |>
fit(received_callback ~ job_city + years_experience + honors + military + has_email_address + race + sex, data = resume, family = "binomial")
resume_fit |>
glance() |>
select(AIC, number_observations = nobs) |>
kbl(
linesep = "", booktabs = TRUE,
digits = 2, align = "rrr"
) |>
kable_styling(
bootstrap_options = c("striped", "condensed"),
latex_options = c("striped"),
full_width = FALSE
) |>
column_spec(1:2, width = "10em", monospace = TRUE)
```
After using the AIC criteria, the variable `college_degree` is eliminated (the AIC value without `college_degree` is smaller than the AIC value on the full model), giving the model summarized in @tbl-resume-fit with fewer variables, which is what we'll rely on for the remainder of the section.
```{r}
#| include: false
terms_chp_09 <- c(terms_chp_09, "Akaike information criterion", "AIC")
```
```{r}
#| label: tbl-resume-fit
#| tbl-cap: |
#| Summary table for the logistic regression model for the resume
#| callback example, where variable selection has been performed
#| using AIC and `college_degree` has been dropped from the model.
#| tbl-pos: H
resume_fit <- logistic_reg() |>
fit(received_callback ~ job_city + years_experience + honors + military + has_email_address + race + sex, data = resume, family = "binomial")
resume_fit |>
tidy() |>
mutate(p.value = ifelse(p.value < 0.0001, "<0.0001", round(p.value, 4))) |>
kbl(
linesep = "", booktabs = TRUE,
digits = 2, align = "lrrrrr"
) |>
kable_styling(
bootstrap_options = c("striped", "condensed"),
latex_options = c("striped")
) |>
column_spec(1, width = "10em", monospace = TRUE) |>
column_spec(2:5, width = "5em")
```
::: {.workedexample data-latex=""}
The `race` variable had taken only two levels: `Black` and `White`.
Based on the model results, what does the coefficient of the `race` variable say about callback decisions?
------------------------------------------------------------------------
The coefficient shown corresponds to the level of `White`, and it is positive.
The positive coefficient reflects a positive gain in callback rate for resumes where the candidate's first name implied they were White.
The model results suggest that prospective employers favor resumes where the first name is typically interpreted to be White.
:::
The coefficient of $\texttt{race}_{\texttt{White}}$ in the full model in @tbl-resume-full-fit, is nearly identical to the model shown in @tbl-resume-fit.
The predictors in the experiment were thoughtfully laid out so that the coefficient estimates would typically not be much influenced by which other predictors were in the model, which aligned with the motivation of the study to tease out which effects were important to getting a callback.
In most observational data, it's common for point estimates to change a little, and sometimes a lot, depending on which other variables are included in the model.
::: {.workedexample data-latex=""}
Use the model summarized in @tbl-resume-fit to estimate the probability of receiving a callback for a job in Chicago where the candidate lists 14 years experience, no honors, no military experience, includes an email address, and has a first name that implies they are a White male.
------------------------------------------------------------------------
We can start by writing out the equation using the coefficients from the model:
$$
\begin{aligned}
log_e \left(\frac{\widehat{p}}{1 - \widehat{p}}\right) = -2.7162 &- 0.4364 \times \texttt{job\_city}_{\texttt{Chicago}} + 0.0206 \times \texttt{years\_experience} \\
&+ 0.7634 \times \texttt{honors} - 0.3443 \times \texttt{military} + 0.2221 \times \texttt{email} \\
&+ 0.4429 \times \texttt{race}_{\texttt{White}} - 0.1959 \times \texttt{sex}_{\texttt{man}}
\end{aligned}
$$
Now we can add in the corresponding values of each variable for the individual of interest:
$$
\begin{aligned}
log_e \left(\frac{\widehat{p}}{1 - \widehat{p}}\right) = - 2.7162 &- 0.4364 \times 1 + 0.0206 \times 14 \\
&+ 0.7634 \times 0 - 0.3443 \times 0 + 0.2221 \times 1 \\
&+ 0.4429 \times 1 - 0.1959 \times 1 = - 2.3955
\end{aligned}
$$
We can now back-solve for $\widehat{p}$: the chance such an individual will receive a callback is about $\frac{e^{-2.3955}}{1 + e^{-2.3955}} = 0.0835.$
:::
\clearpage
::: {.workedexample data-latex=""}
Compute the probability of a callback for an individual with a name commonly inferred to be from a Black male but who otherwise has the same characteristics as the one described in the previous example.
------------------------------------------------------------------------
We can complete the same steps for an individual with the same characteristics who is Black, where the only difference in the calculation is that the indicator variable $\texttt{race}_{\texttt{White}}$ will take a value of 0.
Doing so yields a probability of 0.0553.
Let's compare the results with those of the previous example.
In practical terms, an individual perceived as White based on their first name would need to apply to $\frac{1}{0.0835} \approx 12$ jobs on average to receive a callback, while an individual perceived as Black based on their first name would need to apply to $\frac{1}{0.0553} \approx 18$ jobs on average to receive a callback.
That is, applicants who are perceived as Black need to apply to 50% more employers to receive a callback than someone who is perceived as White based on their first name for jobs like those in the study.
:::
What we have quantified in the current section is alarming and disturbing.
However, one aspect that makes the racism so difficult to address is that the experiment, as well-designed as it is, cannot send us much signal about which employers are discriminating.
It is only possible to say that discrimination is happening, even if we cannot say which particular callbacks --- or non-callbacks --- represent discrimination.
Finding strong evidence of racism for individual cases is a persistent challenge in enforcing anti-discrimination laws.
## Groups of different sizes
Any form of discrimination is concerning, which is why we decided it was so important to discuss the topic using data.
The resume study also only examined discrimination in a single aspect: whether a prospective employer would call a candidate who submitted their resume.
There was a 50% higher barrier for resumes simply when the candidate had a first name that was perceived to be of a Black individual.
It's unlikely that discrimination would stop there.
::: {.workedexample data-latex=""}
Let's consider a sex-imbalanced company that consists of 20% women and 80% men, and we'll suppose that the company is very large, consisting of perhaps 20,000 employees.
(A more deliberate example would include more inclusive gender identities.) Suppose when someone goes up for promotion at the company, 5 of their colleagues are randomly chosen to provide feedback on their work.
Now let's imagine that 10% of the people in the company are prejudiced against the other sex.
That is, 10% of men are prejudiced against women, and similarly, 10% of women are prejudiced against men. Who is discriminated against more at the company, men or women?
------------------------------------------------------------------------
Let's suppose we took 100 men who have gone up for promotion in the past few years.
For these men, $5 \times 100 = 500$ random colleagues will be tapped for their feedback, of which about 20% will be women (100 women).
Of these 100 women, 10 are expected to be biased against the man they are reviewing.
Then, of the 500 colleagues reviewing them, men will experience discrimination by about 2% of their colleagues when they go up for promotion.
Let's do a similar calculation for 100 women who have gone up for promotion in the last few years.
They will also have 500 random colleagues providing feedback, of which about 400 (80%) will be men.
Of these 400 men, about 40 (10%) hold a bias against women.
Of the 500 colleagues providing feedback on the promotion packet for these women, 8% of the colleagues hold a bias against the women.
:::
\clearpage
The example highlights something profound: even in a hypothetical setting where each demographic has the same degree of prejudice against the other demographic, the smaller group experiences the negative effects more frequently.
Additionally, if we would complete a handful of examples like the one above with different numbers, we would learn that the greater the imbalance in the population groups, the more the smaller group is disproportionately impacted.[^09-model-logistic-2]
[^09-model-logistic-2]: If a proportion $p$ of a company are women and the rest of the company consists of men, then under the hypothetical situation the ratio of rates of discrimination against women versus men would be given by $(1 - p) / p,$ a ratio that is always greater than 1 when $p < 0.5$.
Of course, there are other considerable real-world omissions from the hypothetical example.
For example, studies have found instances where people from an oppressed group also discriminate against others within their own oppressed group.
As another example, there are also instances where a majority group can be oppressed, with apartheid in South Africa being one such historic example.
Ultimately, discrimination is complex, and there are many factors at play beyond the mathematics property we observed in the previous example.
We close the chapter on the serious topic of discrimination, and we hope it inspires you to think about the power of reasoning with data.
Whether it is with a formal statistical model or by using critical thinking skills to structure a problem, we hope the ideas you have learned will help you do more and do better in life.
\vfill
## Chapter review {#sec-chp9-review}
### Summary
Logistic and linear regression models have many similarities.
The strongest of which is the linear combination of the explanatory variables which is used to form predictions related to the response variable.
However, with logistic regression, the response variable is binary and therefore a prediction is given on the probability of a successful event.
Logistic model fit and variable selection can be carried out in similar ways as multiple linear regression.
### Terms
The terms introduced in this chapter are presented in @tbl-terms-chp-09.
If you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.
You should be able to easily spot them as **bolded text**.
```{r}
#| label: tbl-terms-chp-09
#| tbl-cap: Terms introduced in this chapter.
#| tbl-pos: H
make_terms_table(terms_chp_09)
```
\vfill
\clearpage
## Exercises {#sec-chp09-exercises}
Answers to odd-numbered exercises can be found in [Appendix -@sec-exercise-solutions-09].
::: {.exercises data-latex=""}
{{< include exercises/_09-ex-model-logistic.qmd >}}
:::