-
Notifications
You must be signed in to change notification settings - Fork 11
/
11-missing-data.Rmd
441 lines (329 loc) · 24 KB
/
11-missing-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
# Missing data {#c11-missing-data}
\index{Missing data|(}
```{r}
#| label: missing-styler
#| include: false
knitr::opts_chunk$set(tidy = 'styler')
```
::: {.prereqbox-header}
`r if (knitr:::is_html_output()) '### Prerequisites {- #prereq11}'`
:::
::: {.prereqbox data-latex="{Prerequisites}"}
For this chapter, load the following packages:
```{r}
#| label: missing-setup
#| error: FALSE
#| warning: FALSE
#| message: FALSE
library(tidyverse)
library(survey)
library(srvyr)
library(srvyrexploR)
library(naniar)
library(haven)
library(gt)
```
We are using data from ANES and RECS described in Chapter \@ref(c04-getting-started). As a reminder, here is the code to create the design objects for each to use throughout this chapter. For ANES, we need to adjust the weight so it sums to the population instead of the sample (see the ANES documentation and Chapter \@ref(c04-getting-started) for more information).
```{r}
#| label: missing-anes-des
#| eval: FALSE
targetpop <- 231592693
anes_adjwgt <- anes_2020 %>%
mutate(Weight = Weight / sum(Weight) * targetpop)
anes_des <- anes_adjwgt %>%
as_survey_design(
weights = Weight,
strata = Stratum,
ids = VarUnit,
nest = TRUE
)
```
For RECS, details are included in the RECS documentation and Chapter \@ref(c10-sample-designs-replicate-weights).
```{r}
#| label: missing-recs-des
#| eval: FALSE
recs_des <- recs_2020 %>%
as_survey_rep(
weights = NWEIGHT,
repweights = NWEIGHT1:NWEIGHT60,
type = "JK1",
scale = 59/60,
mse = TRUE
)
```
:::
## Introduction
Missing data in surveys refer to situations where participants do not provide complete responses to survey questions. Respondents may not have seen a question by design. Or, they may not respond to a question for various other reasons, such as not wanting to answer a particular question, not understanding the question, or simply forgetting to answer. Missing data are important to consider and account for, as they can introduce bias and reduce the representativeness of the data. This chapter provides an overview of the types of missing data, how to assess missing data in surveys, and how to conduct analysis when missing data are present. Understanding this complex topic can help ensure accurate reporting of survey results and provide insight into potential changes to the survey design for the future.
## Missing data mechanisms
\index{Item nonresponse|(}There are two main categories that missing data typically fall into: missing by design and unintentional missing data. Missing by design is part of the survey plan and can be more easily incorporated into weights and analyses. Unintentional missing data, on the other hand, can lead to bias in survey estimates if not correctly accounted for. Below we provide more information on the types of missing data.
1. Missing by design/questionnaire skip logic: This type of missingness occurs when certain respondents are intentionally directed to skip specific questions based on their previous responses or characteristics. For example, in a survey about employment, if a respondent indicates that they are not employed, they may be directed to skip questions related to their job responsibilities. Additionally, some surveys randomize questions or modules so that not all participants respond to all questions. In these instances, respondents would have missing data for the modules not randomly assigned to them.
2. Unintentional missing data: This type of missingness occurs when researchers do not intend for there to be missing data on a particular question, for example, if respondents did not finish the survey or refused to answer individual questions. There are three main types of unintentional missing data that each should be considered and handled differently [@mack; @Schafer2002]:
a. Missing completely at random (MCAR): The missing data are unrelated to both observed and unobserved data, and the probability of being missing is the same across all cases. For example, if a respondent missed a question because they had to leave the survey early due to an emergency.
b. Missing at random (MAR): The missing data are related to observed data but not unobserved data, and the probability of being missing is the same within groups. For example, we know the respondents' ages and older respondents choose not to answer specific questions but younger respondents do answer them.
c. Missing not at random (MNAR): The missing data are related to unobserved data, and the probability of being missing varies for reasons we are not measuring. For example, if respondents with depression do not answer a question about depression severity.
## Assessing missing data
Before beginning an analysis, we should explore the data to determine if there is missing data and what types of missing data are present. Conducting descriptive analysis can help with the analysis and reporting of survey data and can inform the survey design in future studies. For example, large amounts of unexpected missing data may indicate the questions were unclear or difficult to recall. There are several ways to explore missing data, which we walk through below. When assessing the missing data, we recommend using a data.frame object and not the survey object, as most of the analysis is about patterns of records, and weights are not necessary.
### Summarize data
\index{American National Election Studies (ANES)|(}
A very rudimentary first exploration is to use the `summary()` function to summarize the data, which illuminates `NA` values in the data. Let's look at a few analytic variables on the ANES 2020 data using `summary()`:
```{r}
#| label: missing-anes-summary
anes_2020 %>%
select(V202051:EarlyVote2020) %>%
summary()
```
We see that there are `NA` values in several of the derived variables (those not beginning with "V") and negative values in the original variables (those beginning with "V"). We can also use the `count()` function to get an understanding of the different types of missing data on the original variables. For example, let's look at the count of data for `V202072`, which corresponds to our `VotedPres2020` variable.
```{r}
#| label: missing-anes-count
anes_2020 %>%
count(VotedPres2020,V202072)
```
Here, we can see that there are three types of missing data, and the majority of them fall under the "Inapplicable" category. This is usually a term associated with data missing due to skip patterns and is considered to be missing data by design. Based on the documentation from ANES [@debell], we can see that this question was only asked to respondents who voted in the election.
### Visualization of missing data
It can be challenging to look at tables for every variable and instead may be more efficient to view missing data in a graphical format to help narrow in on patterns or unique variables. The {naniar} package is very useful in exploring missing data visually. We can use the `vis_miss()` function available in both {visdat} and {naniar} packages to view the amount of missing data by variable (see Figure \@ref(fig:missing-anes-vismiss)) [@visdattierney; @naniar2023].
```{r}
#| label: missing-anes-vismiss
#| warning: FALSE
#| message: FALSE
#| fig.cap: "Visual depiction of missing data in the ANES 2020 data"
#| fig.alt: "This chart shows a the missingness of the selected variables where missing is highlighted in a dark color. Each row of the plot is an observation and each column is a variable. There are some patterns observed such as a large block of missing for `VotedPres2016_selection` and many of the same respondents also having missing for `VotedPres2020_selection`."
anes_2020_derived<-anes_2020 %>%
select(
-starts_with("V2"), -CaseID, -InterviewMode,
-Weight, -Stratum, -VarUnit)
anes_2020_derived %>%
vis_miss(cluster= TRUE, show_perc = FALSE) +
scale_fill_manual(values = book_colors[c(3,1)],
labels = c("Present","Missing"),
name = "") +
theme(
plot.margin=margin(5.5,30,5.5,5.5, "pt"),
axis.text.x=element_text(angle=70))
```
From the visualization in Figure \@ref(fig:missing-anes-vismiss), we can start to get a picture of what questions may be connected in terms of missing data. Even if we did not have the informative variable names, we could deduce that `VotedPres2020`, `VotedPres2020_selection`, and `EarlyVote2020` are likely connected since their missing data patterns are similar.
Additionally, we can also look at `VotedPres2016_selection` and see that there are a lot of missing data in that variable. The missing data are likely due to a skip pattern, and we can look at other graphics to see how they relate to other variables. The {naniar} package has multiple visualization functions that can help dive deeper, such as the `gg_miss_fct()` function, which looks at missing data for all variables by levels of another variable (see Figure \@ref(fig:missing-anes-ggmissfct)).
```{r}
#| label: missing-anes-ggmissfct
#| warning: FALSE
#| message: FALSE
#| fig.cap: Missingness in variables for each level of 'VotedPres2016,' in the ANES 2020 data
#| fig.alt: "This chart has x-axis 'Voted for President in 2016' with labels Yes, No and NA and has y-axis 'Variable' with labels Age, AgeGroup, CampaignInterest, EarlyVote2020, Education, Gender, Income, Income7, PartyID, RaceEth, TrustGovernment, TrustPeople, VotedPres2016_selection, VotedPres2020 and VotedPres2020_selection. There is a legend indicating fill is used to show pct_miss, ranging from 0 represented by fill very pale blue to 100 shown as fill dark blue. Among those that voted for president in 2016, they had little missing for other variables (light color) but those that did not vote have more missing data in their 2020 voting patterns and their 2016 president selection."
anes_2020_derived %>%
gg_miss_fct(VotedPres2016) +
scale_fill_gradientn(
guide = "colorbar",
name = "% Miss",
colors = book_colors[c(3, 2, 1)]
) +
ylab("Variable") +
xlab("Voted for President in 2016")
```
In Figure \@ref(fig:missing-anes-ggmissfct), we can see that if respondents did not vote for president in 2016 or did not answer that question, then they were not asked about who they voted for in 2016 (the percentage of missing data is 100%). Additionally, we can see with Figure \@ref(fig:missing-anes-ggmissfct) that there are more missing data across all questions if they did not provide an answer to `VotedPres2016`.
\index{American National Election Studies (ANES)|)}
\index{Residential Energy Consumption Survey (RECS)|(}
There are other visualizations that work well with numeric data. For example, in the RECS 2020 data, we can plot two continuous variables and the missing data associated with them to see if there are any patterns in the missingness. To do this, we can use the `bind_shadow()` function from the {naniar} package. This creates a nabular (combination of "na" with "tabular"), which features the original columns followed by the same number of columns with a specific `NA` format. These `NA` columns are indicators of whether the value in the original data is missing or not. The example printed below shows how most levels of `HeatingBehavior` are not missing (`!NA`) in the NA variable of `HeatingBehavior_NA`, but those missing in `HeatingBehavior` are also missing in `HeatingBehavior_NA`.
```{r}
#| label: missing-recs-shadow
recs_2020_shadow <- recs_2020 %>%
bind_shadow()
ncol(recs_2020)
ncol(recs_2020_shadow)
recs_2020_shadow %>%
count(HeatingBehavior,HeatingBehavior_NA)
```
We can then use these new variables to plot the missing data alongside the actual data. For example, let's plot a histogram of the total electric bill grouped by those missing and not missing by heating behavior (see Figure \@ref(fig:missing-recs-hist)).
```{r}
#| label: missing-recs-hist
#| fig.cap: "Histogram of energy cost by heating behavior missing data"
#| fig.alt: "This chart has title 'Histogram of Energy Cost by Heating Behavior Missing Data'. It has x-axis 'Total Energy Cost (Truncated at $5000)' with labels 0, 1000, 2000, 3000, 4000 and 5000. It has y-axis 'Number of Households' with labels 0, 500, 1000 and 1500. There is a legend indicating fill is used to show HeatingBehavior_NA, with 2 levels: !NA shown as very pale blue fill and NA shown as dark blue fill. The chart is a bar chart with 30 vertical bars. These are stacked, as sorted by HeatingBehavior_NA."
recs_2020_shadow %>%
filter(TOTALDOL < 5000) %>%
ggplot(aes(x = TOTALDOL, fill = HeatingBehavior_NA)) +
geom_histogram() +
scale_fill_manual(
values = book_colors[c(3, 1)],
labels = c("Present", "Missing"),
name = "Heating Behavior"
) +
theme_minimal() +
xlab("Total Energy Cost (Truncated at $5000)") +
ylab("Number of Households")
```
Figure \@ref(fig:missing-recs-hist) indicates that respondents who did not provide a response for the heating behavior question may have a different distribution of total energy cost compared to respondents who did provide a response. This view of the raw data and missingness could indicate some bias in the data. Researchers take these different bias aspects into account when calculating weights, and we need to make sure that we incorporate the weights when analyzing the data. \index{Residential Energy Consumption Survey (RECS)|)}
There are many other visualizations that can be helpful in reviewing the data, and we recommend reviewing the {naniar} documentation for more information [@naniar2023].
## Analysis with missing data
\index{Imputation|(}
Once we understand the types of missingness, we can begin the analysis of the data. Different missingness types may be handled in different ways. In most publicly available datasets, researchers have already calculated weights and imputed missing values if necessary. Often, there are imputation flags included in the data that indicate if each value in a given variable is imputed. For example, in the RECS data we may see a logical variable of `ZWinterTempNight`, where a value of `TRUE` means that the value of `WinterTempNight` for that respondent was imputed, and `FALSE` means that it was not imputed. We may use these imputation flags if we are interested in examining the nonresponse rates in the original data. For those interested in learning more about how to calculate weights and impute data for different missing data mechanisms, we recommend @Kim2021 and @Valliant2018weights.
Even with weights and imputation, missing data are most likely still present and need to be accounted for in analysis. This section provides an overview on how to recode missing data in R, and how to account for skip patterns in analysis.
\index{Imputation|)}
### Recoding missing data
\index{American National Election Studies (ANES)|(}
Even within a variable, there can be different reasons for missing data. In publicly released data, negative values are often present to provide different meanings for values. For example, in the ANES 2020 data, they have the following negative values to represent different types of missing data:
* --9: Refused
* --8: Don't Know
* --7: No post-election data, deleted due to incomplete interview
* --6: No post-election interview
* --5: Interview breakoff (sufficient partial IW)
* --4: Technical error
* --3: Restricted
* --2: Other missing reason (question specific)
* --1: Inapplicable
When we created the derived variables for use in this book, we coded all negative values as `NA` and proceeded to analyze the data. For most cases, this is an appropriate approach as long as we filter the data appropriately to account for skip patterns (see Section \@ref(missing-skip-patt)). However, the {naniar} package does have the option to code special missing values. For example, if we wanted to have two `NA` values, one that indicated the question was missing by design (e.g., due to skip patterns) and one for the other missing categories, we can use the `nabular` format to incorporate these with the `recode_shadow()` function.
```{r}
#| label: missing-anes-shadow-recode
anes_2020_shadow<-anes_2020 %>%
select(starts_with("V2")) %>%
mutate(across(everything(),~case_when(.x < -1 ~ NA,
TRUE~.x))) %>%
bind_shadow() %>%
recode_shadow(V201103 = .where(V201103==-1~"skip"))
anes_2020_shadow %>%
count(V201103,V201103_NA)
```
However, it is important to note that at the time of publication, there is no easy way to implement `recode_shadow()` to multiple variables at once (e.g., we cannot use the tidyverse feature of `across()`). The example code above only implements this for a single variable, so this would have to be done manually or in a loop for all variables of interest. \index{American National Election Studies (ANES)|)}
### Accounting for skip patterns {#missing-skip-patt}
When questions are skipped by design in a survey, it is meaningful that the data are later missing. For example, the RECS asks people how they control the heat in their home in the winter (`HeatingBehavior`). This is only among those who have heat in their home (`SpaceHeatingUsed`). If there is no heating equipment used, the value of `HeatingBehavior` is missing. One has several choices when analyzing these data which include: (1) only including those with a valid value of `HeatingBehavior` and specifying the universe as those with heat or (2) including those who do not have heat. It is important to specify what population an analysis generalizes to.
\index{Residential Energy Consumption Survey (RECS)|(}
Here is an example where we only include those with a valid value of `HeatingBehavior` (choice 1). Note that we use the design object (`recs_des`) and then \index{Functions in srvyr!filter|(}filter to those that are not missing on `HeatingBehavior`.\index{Functions in srvyr!survey\_prop} \index{Functions in srvyr!summarize|(}
```{r}
#| label: missing-recs-heatcc
heat_cntl_1 <- recs_des %>%
filter(!is.na(HeatingBehavior)) %>%
group_by(HeatingBehavior) %>%
summarize(
p=survey_prop()
)
heat_cntl_1
```
\index{Functions in srvyr!filter|)}
Here is an example where we include those who do not have heat (choice 2). To help understand what we are looking at, we have included the output to show both variables, `SpaceHeatingUsed` and `HeatingBehavior`. \index{Functions in srvyr!survey\_prop} \index{Functions in srvyr!interact|(}
```{r}
#| label: missing-recs-heatpop
heat_cntl_2 <- recs_des %>%
group_by(interact(SpaceHeatingUsed, HeatingBehavior)) %>%
summarize(
p=survey_prop()
)
heat_cntl_2
```
\index{Functions in srvyr!interact|)} \index{Functions in srvyr!summarize|)}
```{r}
#| label: missing-recs-heattext
#| echo: FALSE
pct_1 <- heat_cntl_1 %>%
filter(str_detect(HeatingBehavior, "Program")) %>%
mutate(p=round(p*100, 1)) %>%
pull(p)
pct_2 <- heat_cntl_2 %>%
filter(str_detect(HeatingBehavior, "Program")) %>%
mutate(p=round(p*100, 1)) %>%
pull(p)
```
If we ran the first analysis, we would say that `r pct_1`% of households with heat use a programmable or smart thermostat for heating their home. If we used the results from the second analysis, we would say that `r pct_2`% of households use a programmable or smart thermostat for heating their home. The distinction between the two statements is made bold for emphasis. Skip patterns often change the universe we are talking about and need to be carefully examined. \index{Residential Energy Consumption Survey (RECS)|)}
\index{American National Election Studies (ANES)|(}
Filtering to the correct universe is important when handling these types of missing data. The `nabular` we created above can also help with this. If we have `NA_skip` values in the shadow, we can make sure that we filter out all of these values and only include relevant missing values. To do this with survey data, we could first create the `nabular`, then create the \index{Functions in srvyr!as\_survey\_design|(} design object on that data, and then use the shadow variables to assist with filtering the data. Let's use the `nabular` we created above for ANES 2020 (`anes_2020_shadow`) to create the design object.
```{r}
#| label: missing-anes-shadow-des
#| warning: FALSE
anes_adjwgt_shadow <- anes_2020_shadow %>%
mutate(V200010b = V200010b/sum(V200010b)*targetpop)
anes_des_shadow <- anes_adjwgt_shadow %>%
as_survey_design(
weights = V200010b,
strata = V200010d,
ids = V200010c,
nest = TRUE
)
```
\index{Functions in srvyr!as\_survey\_design|)}
Then, we can use this design object to look at the percentage of the population who voted for each candidate in 2016 (`V201103`). First, let's look at the percentages without removing any cases: \index{Functions in srvyr!survey\_prop} \index{Functions in srvyr!summarize|(}
```{r}
#| label: missing-anes-shadow-ex1
pres16_select1<-anes_des_shadow %>%
group_by(V201103) %>%
summarize(
All_Missing=survey_prop()
)
pres16_select1
```
Next, we look at the percentages, removing only those missing due to skip patterns (i.e., they did not receive this question). \index{Functions in srvyr!survey\_prop} \index{Functions in srvyr!filter|(}
```{r}
#| label: missing-anes-shadow-ex2
pres16_select2<-anes_des_shadow %>%
filter(V201103_NA!="NA_skip") %>%
group_by(V201103) %>%
summarize(
No_Skip_Missing=survey_prop()
)
pres16_select2
```
Finally, we look at the percentages, removing all missing values both due to skip patterns and due to those who refused to answer the question. \index{Functions in srvyr!survey\_prop}
```{r}
#| label: missing-anes-shadow-ex3
pres16_select3<-anes_des_shadow %>%
filter(V201103_NA=="!NA") %>%
group_by(V201103) %>%
summarize(
No_Missing=survey_prop()
)
pres16_select3
```
\index{Functions in srvyr!filter|)} \index{Functions in srvyr!summarize|)}
```{r}
#| label: missing-anes-shadow-gt
#| echo: FALSE
pres16_select_gt<-pres16_select1 %>%
full_join(pres16_select2,by="V201103") %>%
full_join(pres16_select3,by="V201103") %>%
mutate(Candidate=case_when(V201103==-1~"Did Not Vote for President in 2016",
V201103==1~"Hillary Clinton",
V201103==2~"Donald Trump",
V201103==5~"Other Candidate",
TRUE~"Missing")) %>%
select(Candidate,everything()) %>%
select(-V201103) %>%
gt() %>%
cols_label(All_Missing = "%",
All_Missing_se = "s.e. (%)",
No_Skip_Missing = "%",
No_Skip_Missing_se = "s.e. (%)",
No_Missing = "%",
No_Missing_se = "s.e. (%)") %>%
tab_spanner(label = "Including All Missing Data",
columns = c(All_Missing, All_Missing_se)) %>%
tab_spanner(label = "Removing Skip Patterns Only",
columns = c(No_Skip_Missing, No_Skip_Missing_se)) %>%
tab_spanner(label = "Removing All Missing Data",
columns = c(No_Missing, No_Missing_se)) %>%
fmt_number(decimals = 1, scale_by=100)
```
(ref:missing-anes-shadow-tab) Percentage of votes by candidate for different missing data inclusions
```{r}
#| label: missing-anes-shadow-tab
#| echo: FALSE
#| warning: FALSE
pres16_select_gt %>%
print_gt_book(knitr::opts_current$get()[["label"]])
```
```{r}
#| label: missing-anes-shadow-tab-text
#| echo: FALSE
#| warning: FALSE
pres16_select1_1<-pres16_select1 %>%
filter(V201103==1) %>%
pull(All_Missing)
pres16_select1_2<-pres16_select1 %>%
filter(V201103==2) %>%
pull(All_Missing)
pres16_select1_out<-round(pres16_select1_1*100-pres16_select1_2*100,1)
pres16_select2_1<-pres16_select2 %>%
filter(V201103==1) %>%
pull(No_Skip_Missing)
pres16_select2_2<-pres16_select2 %>%
filter(V201103==2) %>%
pull(No_Skip_Missing)
pres16_select2_out<-round(pres16_select2_1*100-pres16_select2_2*100,1)
```
As Table \@ref(tab:missing-anes-shadow-tab) shows, the results can vary greatly depending on which type of missing data are removed. If we remove only the skip patterns, the margin between Clinton and Trump is `r pres16_select2_out` percentage points; but if we include all data, even those who did not vote in 2016, the margin is `r pres16_select1_out` percentage points. How we handle the different types of missing values is important for interpreting the data.
\index{Item nonresponse|)} \index{American National Election Studies (ANES)|)}
\index{Missing data|)}