-
Notifications
You must be signed in to change notification settings - Fork 1
/
logistic_regression.Rmd
1793 lines (1364 loc) · 67.6 KB
/
logistic_regression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
jupyter:
orphan: true
jupytext:
notebook_metadata_filter: all,-language_info
split_at_heading: true
text_representation:
extension: .Rmd
format_name: rmarkdown
format_version: '1.2'
jupytext_version: 1.15.2
kernelspec:
display_name: Python 3 (ipykernel)
language: python
name: python3
---
# Logistic regression
```{python tags=c("hide-cell")}
import numpy as np
import pandas as pd
# Safe setting for Pandas.
pd.set_option('mode.copy_on_write', True)
import matplotlib.pyplot as plt
from jupyprint import arraytex, jupyprint
# Optimization function
from scipy.optimize import minimize
# For interactive widgets.
from ipywidgets import interact
```
In this page we will look at another regression technique: logistic regression.
We use logistic regression when we want to use a *binary categorical*
variable as our *outcome* variable This contrasts with using a binary categorical
variable *as* a predictor variable, as we've seen on the other [textbook](https://lisds.github.io/dsip/linear_regression_categorical_predictors.html) [pages](https://lisds.github.io/dsip/multiple_predictors_statistical_adjustment.html).
As we know, a binary categorical variable is one where an observation can fall
into one of only two categories. We give each observation a label corresponding to their
category. Some examples are:
* Did a patient die or did they survive through 6 months of treatment? The
patient can only be in only one of the categories. In some column of our
data table, patients that died might have the label "died", and those who
have survived have the label "survived".
* Did a person experience more than one episode of psychosis in the last 5
years ("yes" or "no")?
* Did a person with a conviction for one offense offend again ("yes" or "no")?
For this tutorial, we return to the [chronic kidney disease
dataset]( https://lisds.github.io/textbook/data/chronic_kidney_disease), which
you have seen previously.
Each observational unit in this dataset is one patient.
For each patient, the doctors recorded whether or not the patient had chronic
kidney disease. This is a *binary categorical variable*; you can see the
values in the "Class" column. A value of 1 means the patient *did* have CKD; a
value of 0 means they *did not*. Remember that these numbers are not true numbers:
they are labels that denote category membership, but by using numbers as labels
we can "trick" a regression model into including the variable.
Many of the rest of the columns in the dataframe are measurements from blood tests and urine tests.
```{python}
# read in the data
df = pd.read_csv('data/ckd_clean.csv')
df.head(20)
```
There are actually a large number of binary categorical variables in this
dataset. For example, the "Hypertension" column has labels for the two
categories "yes" (the patient had persistently high blood pressure) or "no".
The categorical variable we are interested in here is "Appetite". This has
the label "good" for patients with good appetite, and "poor" for those with
poor appetite. Poor appetite is a [symptom of chronic kidney
disease](https://www.sciencedirect.com/science/article/abs/pii/S0270929508001666). In our case, we wonder whether the extent of kidney damage does a convincing job in predicting whether the patient has a "good" appetite.
The CKD dataset has a column "Hemoglobin" that has the
concentration of hemoglobin from a blood sample. Hemoglobin is the molecule
that carries oxygen in the blood; it is the molecule that makes the red blood
cells red. Damaged kidneys produce lower concentrations of the hormone that
stimulates red blood cell production,
[erythropoietin](https://en.wikipedia.org/wiki/Erythropoietin), so CKD
patients often have fewer red blood cells, and lower concentrations of
Hemoglobin. We will take lower "Hemoglobin" as a index of kidney damage.
Therefore, we predict that patients with lower "Hemoglobin" values are more
likely to have `poor` "Appetite" values, and, conversely, patients with higher
"Hemoglobin" values are more likely to have `good` "Appetite" values.
First we make a new data frame that just has the two columns we are interested
in:
```{python}
# isolate only the variables of interest
hgb_app = df.loc[:, ['Hemoglobin', 'Appetite']]
# rename the columns for easier typing
hgb_app = hgb_app.rename(columns = {'Hemoglobin': 'hemoglobin',
'Appetite' : 'appetite'})
hgb_app.head(20)
```
## Dummy Variables
The model we want to fit is here is:
`appetite` ~ `hemoglobin`
Which we read as "we want to model `appetite` as a function of `hemoglobin`".
We will soon find ourselves wanting to do calculations on the values in the
`appetite` column, and we cannot easily do that with the current string values
of "good" and "poor", for the reasons we've seen previously, on the categorical
predictors [page](https://lisds.github.io/dsip/linear_regression_categorical_predictors.html).
Just as when we use a categorical variable as a predictor,
our next step is to recode the string values to
numbers, ready for our calculations. We use 1 to represent "good" and 0 to
represent "poor". This kind of recoding, where we replace category labels
with 1 and 0 values, is often called *dummy coding*. The notation for the current
dummy-coding scheme is:
`appetite_dummy` $ = \begin{cases} 1, & \text{if $i$ is `good`} \\ 0, & \text{if $i$ is `poor`} \end{cases} $
```{python}
# 'replace' replaces the values in the first argument with the corresponding
# values in the second argument.
hgb_app['appetite_dummy'] = hgb_app['appetite'].replace(['poor', 'good'],
[0, 1])
hgb_app.head(20)
```
*Note*: When you are doing this, be sure to keep track of which label you have
coded as 1. Normally this would be the more interesting outcome. In this
case, "good" has the code 1. Keep track of the label corresponding to 1, as it
will affect the interpretation of the regression coefficients.
Now we have the dummy (1 or 0) variable, let us use a scatter plot to look at
the relationship between hemoglobin concentration and whether a patient has a
good appetite.
```{python tags=c("remove-input")}
# We will use this plotting code several times, so put into a function for
# later use.
def plot_hgb_app():
# Build plot, add custom labels.
colors = hgb_app['appetite'].replace(['poor', 'good'], ['red', 'blue'])
hgb_app.plot.scatter('hemoglobin', 'appetite_dummy', c=colors)
plt.ylabel('Appetite\n0 = poor, 1 = good')
plt.yticks([0,1]); # Just label 0 and 1 on the y axis.
# Put a custom legend on the plot. This code is a little obscure.
plt.scatter([], [], c='blue', label='good')
plt.scatter([], [], c='red', label='poor')
# Do the plot
plot_hgb_app()
# Show the legend
plt.legend();
```
From the plot, it does look as if the patients with lower hemoglobin are more
likely to have poor appetite (`appetite_dummy` values of 0), whereas patients
with higher hemoglobin tend to have good appetite (`appetite_dummy` values of
1).
Now we start to get more formal, and develop a model with which we predict the
`appetite_dummy` values from the `Hemoglobin` values.
## How about linear regression?
Remember that, in linear regression, we predict scores on the *outcome*
variable (or vector) using a straight-line relationship of the *predictor*
variable (or vector).
Here are the predictor and outcome variables in our case - we'll store these
as separate variables to save some typing:
```{python}
# The x (predictor) and y (outcome) variables.
hemoglobin = hgb_app['hemoglobin']
appetite_d = hgb_app['appetite_dummy']
```
Why not use linear regression for the present data? After all, our *outcome variable* (`appetite_d`) and our *predictor variable* (`hemoglobin`) here are both sequences of numbers, with one value for each observational unit (row) in the dataset. The linear model here is shown below, using the first 20 values of each vector (to stop the printout being unwieldy)!:
```{python}
# do not worry about this code, it just generates the notation below
jupyprint("Here is the our model ($ \\vec{y} = b \\vec{x} + \\text{c} + \\vec{\\varepsilon} $), showing the first 10 actual values within the `hemoglobin` and `appetite_d` vectors:")
jupyprint(f"${arraytex(np.atleast_2d(appetite_d[:20]).T)} = b * {arraytex(np.atleast_2d(hemoglobin[:20]).T)} + c +" +" \\vec{\\varepsilon}$")
```
Let's compare the values in the vectors above to the first 20 values of each variable in the `hgb_app` dataframe, just to
assure ourselves they are the same:
```{python}
# for comparison with the vectors above
hgb_app[['appetite_dummy', 'hemoglobin']].iloc[:20]
```
This looks like something our linear regression machinery will be able to work with.
In contrast to the vector printout above, we actually have many more than 20 observational units:
```{python}
jupyprint(f'Rows in CKD data: {len(hgb_app)}')
```
We are familiar now with performing linear regression to
find the value of the slope and intercept of the line which gives the smallest
sum of the squared prediction errors.
Recall that, in linear regression:
$ \text{(predicted scores)} = b * \text{(predictor variable)} + c $
Or, in mathematical notation:
$\Large \vec{\hat{y}} = b \vec{x} + \text{c}$
As we know, there will be the same number of scores on the
predictor variable (`hemoglobin`), and the same number of predictions ($\hat{y}$).
By contrast, the slope and intercept are single values, defining
the line.
We used `minimize` to find the values of the slope and intercept which give the
"best" predictions. So far, we have almost invariably defined the *best*
values for slope and intercept as the values that give the smallest sum of the
squared prediction errors.
$\text{(prediction errors)} = \text{(actual scores) - (predicted scores)}$
Or, in mathematical notation:
$$
\vec{\varepsilon}= \begin{bmatrix}
\varepsilon_1 \\
\varepsilon_2 \\
\vdots \\
\varepsilon_n
\end{bmatrix} = \begin{bmatrix}
y_1 - \hat{y}_1 \\
y_2 - \hat{y}_2 \\
\vdots \\
y_n - \hat{y}_n
\end{bmatrix} =
\begin{bmatrix}
y_1 - (b x_1 + c)\\
y_2 - (b x_2 + c) \\
\vdots \\
y_n - (b x_n + c)
\end{bmatrix}
$$
What would happen if we tried to use linear regression to predict the
appetite scores, based on hemoglobin concentrations?
Let us start by grabbing the our familiar `ss_any_line` function from the
[Using minimize
page](https://lisds.github.io/textbook/mean-slopes/using_minimize).
```{python}
def ss_any_line(intercept_and_slope, x_values, y_values):
# Intercept_and_slope is a list containing two elements, an intercept and a slope.
intercept, slope =intercept_and_slope
# Values predicted from these x_values, using this intercept and slope.
predicted = intercept + x_values * slope
# Difference of prediction from the actual y values.
error = y_values - predicted
# Sum of squared error.
return np.mean(error ** 2)
```
As we know, the sum of the squared prediction error, in linear regression, is our *cost*
function. When we have a good pair of (intercept, slope) in `intercept_and_slope`, our function
is *cheap* - i.e. the returned value is small. When we have a bad pair in
`intercept_and_slope`, our function is *expensive* - the returned value is large.
If the value from `ss_any_line` is large, it means the line we are fitting
does not fit the data well. The purpose of linear regression is to find the
line which leads to the smallest cost. In our case, the cost is the sum of the
squared prediction errors.
Let's use linear regression on the current example. From looking at our plot
above, we start with a guess of -1 for the intercept, and 0.1 for the slope.
```{python}
# Use minimize to find the least sum of squares solution.
min_lin_reg = minimize(ss_any_line, [-1, 0.1], args=(hemoglobin, appetite_d))
# Show the results that came back from minimize.
min_lin_reg
```
OK, so that looks hopeful. Using linear regression with `minimize` we found that the sum of squared prediction errors was smallest for a line with:
```{python}
# Unpack the slope and intercept estimates from the results object.
lin_reg_intercept, lin_reg_slope = min_lin_reg.x
# Show them.
jupyprint(f'Best linear regression intercept <b> {lin_reg_intercept.round(2)}</b>')
jupyprint(f'Best linear regression slope <b> {lin_reg_slope.round(2)} </b>')
```
Let's calculate the predicted scores, from the now familiar formula:
$\vec{\hat{y}} = b \vec{x} + \text{c}$
Which, in python terms, in the current case, is:
```
predicted_scores= lin_reg_slope * hemoglobin + lin_reg_intercept
```
```{python}
# calculate the predicted values
predicted_lin_reg = lin_reg_slope * hemoglobin + lin_reg_intercept
predicted_lin_reg
```
Let's plot our predictions, alongside the actual data. We plot the predictions
from linear regression in orange.
```{python tags=c("remove-input")}
# Do the base plot of the hemoglobin and appetite_d.
plot_hgb_app()
# A new plot on top of the old.
plt.scatter(hemoglobin, predicted_lin_reg,
label='Linear regression prediction $(\hat{y}$)',
color='orange')
# Another plot, to show the underlying line
fine_x = np.linspace(np.min(hemoglobin), np.max(hemoglobin), 1000)
fine_y = lin_reg_intercept + lin_reg_slope * fine_x
plt.plot(fine_x, fine_y, linewidth=1, linestyle=':')
# Show the legend.
plt.legend(loc=(1.1, 0.5));
```
<i> <b> Question: </b> can anyone see any issues with our predictions here? </i>
The linear regression line looks plausible, as far as it goes, but it has
several unhappy features for our task of predicting the `appetite_d` 0 or 1
values.
One thing to like about the line is that the predictions are right to suggest
that the value of `appetite_d` is more likely to be 1 (meaning "good") at
higher values of `hemoglobin`. Also, the prediction line slopes upward as
`hemoglobin` gets higher, indicating that the probability of good appetite gets
higher as the hemoglobin concentration rises, across patients.
However, when the `hemoglobin` gets higher than about 15.5, linear regression
starts to predict a value for `appetite_d` that is greater than 1 - which, of
course, cannot occur in the `appetite_d` values, which are restricted to 0 or
1.
In fact, if we make predictions from this model for sufficiently low hemoglobin
values, we find that our predictions for `appettite_d` also fall *below* 0, which
again, is not a possible value of `appetite_d`:
```{python}
# Do the base plot of the hemoglobin and appetite_d.
plot_hgb_app()
# A new plot on top of the old.
plt.scatter(np.linspace(0, hemoglobin.max()),
lin_reg_intercept + lin_reg_slope* np.linspace(0, hemoglobin.max()),
label='Linear regression prediction $(\hat{y}$)',
color='orange')
# Another plot, to show the underlying line
fine_x = np.linspace(0, np.max(hemoglobin), 1000)
fine_y = lin_reg_intercept + lin_reg_slope * fine_x
plt.plot(fine_x, fine_y, linewidth=1, linestyle=':')
# Show the legend.
plt.legend(loc=(1.1, 0.5));
```
The problem here is that linear regression can produce predicted values ($\hat{y}$) ranging from negative infinity to positive infinity. One way of saying this is that the predictions from linear regression are unconstrained or unbounded.
However, the outcome variable we are trying to predict (`appetite_d`) *is* constrained/bounded: it can only take values of either 0 or 1.
The plot below shows this problem, by showing the linear regression line for very large and very small values of `hemoglobin`, based on the model parameters:
```{python}
# Do the base plot of the hemoglobin and appetite_d.
plot_hgb_app()
# A new plot on top of the old.
plt.scatter(np.linspace(0, hemoglobin.max()),
lin_reg_intercept + lin_reg_slope* np.linspace(0, hemoglobin.max()),
label='Linear regression prediction $(\hat{y}$)',
color='orange')
# Another plot, to show the underlying line
fine_x_big = np.linspace(-40, 50, 1000)
fine_y_big = lin_reg_intercept + lin_reg_slope * fine_x_big
plt.plot(fine_x_big, fine_y_big, linewidth=1, linestyle=':')
# Show the legend.
plt.legend(loc=(1.1, 0.5));
```
These reflections might make as wonder whether we should be using something other
than a simple, unconstrained straight line for our predictions.
## Another prediction line
Here's another prediction line we might use for `appetite_d`, with the
predicted values. For the moment, let's not worry about how we came by this line, we will come
onto that soon. (If it wasn't obvious, this is the sort of prediction line we get from
logistic regression)!
We will show to the mechanics of how we get such a prediction line -
but for now, we will only say it involves a different cost function
to the one we use in linear regression.
For now, let's appreciate why the new line might work better than the linear regression line, for the
current data.
The new prediction line is in gold.
```{python tags=c("hide-cell")}
# This is the machinery for making the sigmoid line of the plots below. We
# will come on that machinery soon. For now please ignore this code, and
# concentrate on the plots below.
def inverse_logit(y):
""" Reverse logit transformation
"""
odds = np.exp(y) # Reverse the log operation.
return odds / (odds + 1) # Reverse odds operation.
def params_to_predicted_probabilities(intercept, slope, x):
""" Calculate predicted probability of of `dummy_variable == 1` for each observation.
"""
# Predicted log odds of being in class 1.
predicted_log_odds = intercept + slope * x
return inverse_logit(predicted_log_odds)
# Some plausible values for intercept and slope.
nice_intercept, nice_slope = -7, 0.8
predictions_new = params_to_predicted_probabilities(nice_intercept, nice_slope, hemoglobin)
# Do the base plot of the hemoglobin and appetite_d.
plot_hgb_app()
# A new plot on top of the old.
plt.scatter(hemoglobin, predictions_new,
label='New prediction \n(from logistic regression)',
color='gold')
# Another plot, to show the underlying line
fine_y_sigmoid = params_to_predicted_probabilities(nice_intercept, nice_slope, fine_x)
plt.plot(fine_x, fine_y_sigmoid, linewidth=1, linestyle=':')
# Show the legend.
plt.legend(loc=(1.1, 0.5))
plt.title('A different prediction line');
```
The new not-straight line - from now on let's call it a curve - seems to have much to recommend it. This shape of
curve is called "sigmoid", from the name of the Greek letter "s".
One nice feature of the sigmoid curve is that the sigmoid
prediction here never goes above 1 or below 0, so its values are always in the
range of the `appetite_d` data it is trying to predict.
The sigmoid curve climbs steeply to a prediction of 1, and plateaus there, as we get to the
threshold of hemoglobin around 12.5, at which every patient does seem to have
"good" appetite (`appetite_d` of 1). This is a nice feature, as the predictions
at higher values of `hemoglobin` fit the data better than the linear regression line did.
We can think of the values from the sigmoid curve as being *predicted
probabilities* an observational unit falling in whichever category we
have dummy coded as 1.
So in the present case, the curve shows us the predicted probability of having
a good appetite, for each observational unit (patient), given that patient's
`hemoglobin` score.
For example, at a `hemoglobin` value of 10, the curve gives a
predicted y (`appetite_d`) value of about 0.73. We can interpret this
prediction as saying that, with a hemoglobin value of 10, there is a
*probability* of about 0.73 that the corresponding patient will have a "good"
appetite (`appetite_d` value of 1).
Red dashed lines have been added to the plot, to show this prediction:
```{python}
# Do the base plot of the hemoglobin and appetite_d.
plot_hgb_app()
plt.yticks(np.linspace(0, 1, 6).round(2))
plt.ylabel('Predicted Probability \n (`appetite_d == 1`)')
# A new plot on top of the old.
plt.scatter(hemoglobin, predictions_new,
label='New prediction',
color='gold')
# Another plot, to show the underlying line
fine_y_sigmoid = params_to_predicted_probabilities(nice_intercept, nice_slope, fine_x)
plt.plot(fine_x, fine_y_sigmoid, linewidth=1, linestyle=':')
plt.plot([10, 10],
[0, params_to_predicted_probabilities(nice_intercept, nice_slope, 10)],
color='red', linestyle='--')
plt.plot([10, 0],
[params_to_predicted_probabilities(nice_intercept, nice_slope, 10),
params_to_predicted_probabilities(nice_intercept, nice_slope, 10)],
color='red', linestyle='--')
# Show the legend.
plt.legend();
```
The sigmoid curve, which we get via logistic regression, is exploiting the fact that probabilities range between 0 and 1,
and that our outcome variable is also constrained to be either 0 or 1. This
means we can meaningfully show the prediction line (showing the predicted
probabilities) and the outcome variable scores (0 or 1) on the same plot. (We will come
on to the mechanics of the fitting process later in the page).
Because of these appealing features, let's say we do want to use this kind of
sigmoid line to predict `appetite_d`. So far, we have only asked `minimize`
to predict directly from a straight line - for example, in the `ss_any_line` function.
How can we get `minimize` to predict from a family of sigmoid curves, as here?
We'd like to keep the machinery here as familiar as possible. So, ideally we'd like
to send a slope and intercept pair to `minimize`. How might we do that for a sigmoid curve,
given that a slope and intercept denote a straight line, rather than a curve?
We would need a conversion to transform a sigmoid curve like the one here, with y values from 0 to 1,
into a straight line, where the y values can vary from large negative number to large
positive number. Doing such a conversion, so we have a slope and
intercept that `minimize` can work with easily.
Fortunately, our mathematician friends can assure us that we can in fact go from our sigmoid curve (ranging
between 0 and 1) to a straight line with unconstrained y values, in two fairly simple steps. The
next sections will cover those steps. The two steps are:
* Convert the 0 or 1 *probability* predictions to
predictions of the *odds*. The odds can vary from 0 to very
large positive number.
* Apply the *logarithm* function to convert the 0-to-very-large-positive
odds predictions to *log* odds predictions, which can vary from
very large negative number to very large positive number.
(Do not worry if these steps don't make sense yet - we'll go over them in more detail below).
These two transformations together are called the *log-odds* or
[logit](https://en.wikipedia.org/wiki/Logit) transformation. *Logistical
regression* is regression using the *logit* transform. Applying the logit
transform converts the sigmoid curve to a straight line.
We will explain more about the two stages of the transform below, but for now, here are the two stages in action, shown graphically.
This is the original sigmoid curve above, with the predictions, in its own
plot:
```{python tags=c("remove-input")}
# show a plot of the probability prediction curve
plt.scatter(hemoglobin, predictions_new, color='gold')
plt.plot(fine_x, fine_y_sigmoid, linewidth=1, linestyle=':')
plt.title('Sigmoid probability prediction')
plt.xlabel('Hemoglobin')
plt.ylabel('Probability prediction');
```
Next we apply the conversion which converts probability to odds:
```{python tags=c("remove-input")}
# show the predictions, on the odds scale
predictions_or = predictions_new / (1 - predictions_new)
plt.scatter(hemoglobin, predictions_or, color='gold')
fine_y_or = fine_y_sigmoid / (1 - fine_y_sigmoid)
plt.plot(fine_x, fine_y_or, linewidth=1, linestyle=':')
plt.title('Odds prediction')
plt.xlabel('Hemoglobin')
plt.ylabel('odds');
```
Notice that this is an *exponential* graph, where the y values increase more
and more steeply as the x values increase. We can turn exponential lines like
this one into straight lines, using the *logarithm* function, the next stage of
the logit transformation:
```{python}
# show the predictions on the log odds scale
predictions_or_log = np.log(predictions_or)
plt.scatter(hemoglobin, predictions_or_log, color='gold')
fine_y_or_log = np.log(fine_y_or)
plt.plot(fine_x, fine_y_or_log, linewidth=1, linestyle=':')
plt.title('Logit prediction')
plt.xlabel('Hemoglobin')
plt.ylabel('Log odds');
```
We can do this conversion *in either direction*:
* straight line -> sigmoid curve
* sigmoid curve -> straight line
**This has the important consequence that** we can get `minimize` to find us a straight line,
then we can use our conversion to go back to the sigmoid curve. The sigmoid curve is on the original scale
of the data, where the outcome variable ranges between 0 and 1. We can then calculate error values
from the sigmoid curve and the original data points.
Pause here and re-read the last paragraph again. It is very important to understand
*why* we are doing these transformations, in order to not get lost in the detail.
To reiterate these are the steps:
- we (or `minimize`) generate a straight line (via a slope/intercept pairing)
- we transform this straight line into a sigmoid curve
- we assess use a goodness-of-fit metric (like error scores) to assess the fit of the sigmoid curve
- we repeat this process until we (or `minimize`) find the best-fitting sigmoid curve
The next few sections go into more detail on the odds and logarithm transformation steps.
## Probability and Odds
Because logistic regression uses probability and odds, we will go over these
foundational concepts in this section.
Specifically, for logistic regression, in contrast to linear regression, we are interested in
predicting the *probability of an observation falling into a particular outcome
category* (0 or 1).
In our case `0 == poor appetite` and `1 == good appetite`.
So, we are interested in the probability of a patient having `good`
appetite, predicted from the patient's `hemoglobin` score.
We can think of probability as the *proportion of times* we expect to see a
particular outcome.
For example, there are 139 patients with `good` appetite in this data frame,
and 158 patients in total. If you were to repeatedly draw a single patient at
random from the data frame, and record their `appetite`, then we expect the proportion of "good" values in the long run, to be $\Large \frac{139}{158}$ --- which is about $0.88$. That is the same as saying there is a probability of 0.88 of a randomly-drawn patient of having a `good` appetite.
```{python}
# the probability of having a `good` appetite
hgb_app['appetite_dummy'].value_counts()/len(hgb_app)
```
Because the patient's appetite can only be `good` or `poor`, and because the
probabilities of all possible options have to add up to 1, the probability of
the patient having a `poor` appetite is $1 - 0.88$, which is about $0.12$.
So, the probability can express the *proportion of times* we expect to see some
*event of interest* - in our case, the event of interest is "good" in the
`appetite` column:
$\Large \text{probability} = \frac{\text{number of things of interest}}{\text{total number of things}}$
We can think of this same information as an *odds*.
We often express probabilities as odds. To break this down, for clarity let's think of a scenario with
some nice, clean probabilities.
Imagine there are 100 people standing in a room. 20 of them are civil servants. In this scenario, our *event of interest* is "being a civil servant".
If we picked a person at random, the probability they will be a civil servant is:
$\text{Prob(Civil Servant}) = \Large \frac{20}{100}$.
By contrast, the odds is the number of times we expect to see the event of interest
(e.g. being a civil servant) for every time we expect to see the event of no interest (NOT
being a civil servant):
$ \Large \text{odds} = \frac{\text{number of things of interest}}{\text{number of things NOT of interest}}$
In the civil servant example, the *odds* of being a civil servant, if we pick a person at random, are:
$\text{Odds(Civil Servant}) = \Large \frac{20}{80}$.
This is just a way of saying a probability in a different way, and we can convert easily between probabilities and odds.
To convert from a probability to an odds, we remember that the odds is the number of times we expect to see the event of interest for every time we expect to see the event of no interest. This is the probability (proportion) for the event of interest, divided by the probability (proportion) of the event of NO interest.
We get the probability of events of no interest by subtracting the probability of an event of interest from 1.
Let's look at the probability $p$:
```{python}
# Our probability of interest
p = 20/100
p
```
In mathematical notation:
$ \Large \text{probability} = \frac{\text{number of things of interest}}{\text{total number of things}} = \frac{20}{100} = 0.2$
The equivalent odds here is:
$\Large \text{odds} = \frac{\text{number of things of interest}}{\text{number of things NOT of interest}} = \frac{20}{80} = 0.25$
We can also calculate the odds using this formula:
$\Large \text{odds} = \frac{\text{Prob(Event of Interest)}}{1 - \text{Prob(Event of Interest)}}$
```{python}
# odds is proportion of interest, divided by proportion of no interest.
odds = p / (1 - p)
odds
```
$(1 - p)$ gives the the probability of events which are not of interest (in this case, this is people who are NOT civil servants):
$\text{Prob(Civil Servant}) = \Large \frac{20}{100}$
$\text{Prob(NOT Civil Servant}) = \Large \frac{80}{100}$
Let's demonstrate this in python:
```{python}
(1 - p) == 80/100
```
When you are interpreting odds:
**Odds of 1 indicate that the event of interest is equally as likely as events NOT of interest.**
**Odds of less than one means that the event of interest is less likely to happen than
events NOT of interest.**.
**Odds of greater than one means the event of interest is more likely to
happen than events NOT of interest.**
**If the probability of an event of interest is 0, then the odds are 0.**
You can use the slider in the output of the cell below to set the probability of a hypothetical event of interest.
(If you want a probability of 50%, set the slider to 50, if you want a probability of 30%, set the slider to 30 and so on).
The printout will then allow you to relate the probability to the odds of the event of interest:
```{python}
def odds_interact(probability_percent=50):
assert (probability_percent >= 0) & (probability_percent <=99.99), "You must enter a percentage between 0 and 99.99!"
p = probability_percent/100
jupyprint(f"If a probability of an event of interest is **{probability_percent}%**:")
jupyprint("$\\text{Prob(Event of Interest)} = \\frac{"+f"{probability_percent}"+"}{100}$")
jupyprint(f"Then the odds of the event of interest are **{round(p/(1-p), 4)}**")
jupyprint("$\\text{Odds(Event of Interest)} = \\frac{"+f"{probability_percent}"+"}{"+f"{100-probability_percent}"+"}$")
interact(odds_interact, probability_percent=(0, 99, 1))
```
As the probability of an event of interest gets very close to 1, the odds gets very large - this means the event of interest is many many times more likely than events of no interest:
```{python}
p_approaching_1 = 0.9999999999999999
large_odds = p_approaching_1 / (1 - p_approaching_1)
large_odds
```
Remember the example above where the probability of being a civil servant was
$\frac{20}{100}$? We can also convert from odds to probabilities. Remember the odds is
the number of times we expect to see the event of interest, divided by the
number of times we expect to see the event of no interest. The probability is
the proportion of events of interest out of all events.
We can convert back to the probability from the odds, using this formula:
$ \Large p = \frac{\text{odds}}{1 + \text{odds}} = \frac{0.25}{1.25} = 0.2 $
```{python}
# convert from odds to probability
p_from_odds = odds / (1 + odds)
p_from_odds
```
In summary here is that we can convert probabilities to odds with:
```{python}
# convert from probability to odds
odds = p / (1 - p)
odds
```
And we can convert odds to probabilities with:
```{python}
# convert odds to probabilities
p = odds / (odds + 1)
p
```
As you've seen, for the current data, and for sigmoid probability curve we showed you above, when we apply the conversion to convert the probability
values to odds, we get the following (this is the graphical perspective on the probability-to-odds transformation just shown above):
```{python tags=c("remove-input")}
# show a plot of the probability prediction curve
plt.figure(figsize=(8, 4))
plt.subplot(1, 2, 1)
plt.scatter(hemoglobin, predictions_new, color='gold')
plt.plot(fine_x, fine_y_sigmoid, linewidth=1, linestyle=':')
plt.title('Sigmoid probability prediction')
plt.xlabel('Hemoglobin')
plt.ylabel('Probability prediction')
# converting from probabilities to odds
plt.subplot(1, 2,2)
predictions_or = predictions_new / (1 - predictions_new)
plt.scatter(hemoglobin, predictions_or,
color='gold')
fine_y_or = fine_y_sigmoid / (1 - fine_y_sigmoid)
plt.plot(fine_x, fine_y_or, linewidth=1, linestyle=':')
plt.title('odds prediction')
plt.xlabel('Hemoglobin')
plt.ylabel('odds');
```
Notice that the odds we found vary from very close to 0 (for probability
predictions very close to 0) to very large (for probability
predictions very close to 1).
Notice too that our graph looks exponential, and we want it to be a straight
line (so we can pass a slope/intercept pair to `minimize`).
Our next step is to apply a *logarithm* transformation.
## The logarithm transform
See the [logarithm refresher](https://lisds.github.io/textbook/more-regression/logarithms_refreshed) page for
more background on logarithms.
For now, **the only thing you need to know about logarithms is that they are
transformations that convert an exponential curve into a straight line**.
However, a very brief description is given just below. *We strongly recommend reading
the rest of the page first (skip to the next graph we show), and then come back
and read this brief explanation of logarithms*.
Essentially, a logarithm is an answer to the question "given a number, what power
would I have to raise a specific other number to, in order to get the first number?".
So, let's call that first number $a$, let's raise it to the the power $n$ and call the result
$y$:
$\Large a^n = y$
Here's an example:
```{python}
a = 10
n = 2
y = a**n
y
```
A logarithm tells us, given our value of $a$, what power would we have to raise $a$ to, to get $y$?
We can write this as:
$\Large \log_{a}(y) = n$
You can read this as "if you give me the number $y$, what power would I have to raise the number $a$ to, in order to get $y$?".
As you can imagine, `numpy` functions exist to calculate logs:
```{python}
np.log10(y)
```
```{python}
np.log10(y) == n
```
The most important thing to understand is that logarithms transform an exponential graph into a straight line:
```{python tags=c("remove-input")}
n = np.linspace(0, 10, 1000)
y = a ** n
fig, axes = plt.subplots(1, 2)
ax0, ax1 = axes
ax0.plot(n, y)
ax0.set_xlabel('n')
ax0.set_ylabel('y')
ax0.set_title('Exponential Graph \n $y=10^n$')
ax1.set_xlabel('n')
ax1.set_ylabel('log(y)')
ax1.plot(n, np.log10(y))
ax1.set_title(r'Log of $10^n$');
```
You have already see logs in action transforming the odds predictions.
```{python tags=c("remove-input")}
fig, axes = plt.subplots(1, 2)
ax0, ax1 = axes
ax0.scatter(hemoglobin, predictions_or, color='gold')
ax0.plot(fine_x, fine_y_or, linewidth=1, linestyle=':')
ax0.set_title('Odds')
ax0.set_xlabel('Hemoglobin')
ax0.set_ylabel('odds');
ax1.scatter(hemoglobin, predictions_or_log, color='gold')
ax1.plot(fine_x, fine_y_or_log, linewidth=1, linestyle=':')
ax1.set_title('Log odds')
ax1.set_xlabel('Hemoglobin')
ax1.set_ylabel('Log odds');
```
## The logit transform and its inverse
The logit transformation from the sigmoid curve to the straight line consists
of two steps:
* Convert probability to odds.
* Take the log of the result.
The full logit transformation is therefore:
```{python}
def logit(p):
""" Apply logit transformation to array of probabilities `p`
"""
odds = p / (1 - p)
return np.log(odds)
```
Here are the original sigmoid predictions and the predictions with the `logit` transform applied:
```{python tags=c("remove-input")}
fig, axes = plt.subplots(1, 2)
ax0, ax1 = axes
ax0.scatter(hemoglobin, predictions_new, color='gold')
ax0.plot(fine_x, fine_y_sigmoid, linewidth=1, linestyle=':')
ax0.set_title('Sigmoid p')
ax0.set_xlabel('Hemoglobin')
ax0.set_ylabel('Sigmoid p values');
ax1.scatter(hemoglobin, logit(predictions_new), color='gold')
ax1.plot(fine_x, logit(fine_y_sigmoid), linewidth=1, linestyle=':')
ax1.set_title('Logit on sigmoid')
ax1.set_xlabel('Hemoglobin')
ax1.set_ylabel('Log odds');
```
We also want to be able to go backwards, from the straight-line predictions, to
the sigmoid predictions.
`np.exp` reverses (inverts) the `np.log` transformation (see the [logarithm
refresher]((https://lisds.github.io/textbook/more-regression/logarithms_refreshed)
page):
```{python}
# np.exp reverses the effect of np.log.
# let's take some values...
some_values = np.array([1, 0.5, 3, 6, 0.1])
some_values
```
```{python}
# let's find the logarithm of these values...
logged_values = np.log(some_values)
logged_values
```
```{python}
# np.exp reverses the effect of np.log
values_back = np.exp(np.log(some_values))
values_back
```
You have seen above that there is a simple formula to go from odds to
probabilities. The transformation that *reverses* (inverts) the logit
transform is therefore:
```{python}
def inverse_logit(v):
""" Reverse logit transformation on array `v`
"""
odds = np.exp(v) # Reverse the log operation.
return odds / (odds + 1) # Reverse odds operation.
```
`inverse_logit` takes points on a straight line, and converts them to points on a
sigmoid.
First we convince ourselves that `inverse_logit` does indeed reverse the `logit`
transform:
```{python}
# some probability values
some_p_values = np.array([0.01, 0.05, 0.1, 0.5, 0.9, 0.95, 0.99])
some_p_values
```
```{python}
# we apply the logit transformation
some_log_odds = logit(some_p_values)
some_log_odds
```
```{python}
# we use the inverse logit transformation to go back to the original values
back_again = inverse_logit(some_log_odds)
print('Logit, then inverse_logit returns the original data:')