@@ -25,9 +25,8 @@ source("_common.R")
25
25
26
26
## Getting data into ` epi_archive ` format
27
27
28
- An ` epi_archive ` object
29
- can be constructed from a data frame, data table, or tibble, provided that it
30
- has (at least) the following columns:
28
+ An ` epi_archive ` object can be constructed from a data frame, data table, or
29
+ tibble, provided that it has (at least) the following columns:
31
30
32
31
* ` geo_value ` : the geographic value associated with each row of measurements.
33
32
* ` time_value ` : the time value associated with each row of measurements.
@@ -55,10 +54,10 @@ class(x)
55
54
print(x)
56
55
```
57
56
58
- An ` epi_archive ` is special kind of class called an R6 class. Its primary field
59
- is a data table ` DT ` , which is of class ` data.table ` (from the ` data.table `
60
- package), and has columns ` geo_value ` , ` time_value ` , ` version ` , as well as any
61
- number of additional columns.
57
+ An ` epi_archive ` is an S3 class. Its primary field is a data table ` DT ` , which
58
+ is of class ` data.table ` (from the ` data.table ` package), and has columns
59
+ ` geo_value ` , ` time_value ` , ` version ` , as well as any number of additional
60
+ columns.
62
61
63
62
``` {r}
64
63
class(x$DT)
@@ -70,33 +69,18 @@ for the data table, as well as any other specified in the metadata (described
70
69
below). There can only be a single row per unique combination of key variables,
71
70
and therefore the key variables are critical for figuring out how to generate a
72
71
snapshot of data from the archive, as of a given version (also described below).
73
-
72
+
74
73
``` {r, error=TRUE}
75
74
key(x$DT)
76
75
```
77
-
78
- In general, the last version of each observation is carried forward (LOCF) to
79
- fill in data between recorded versions. ** A word of caution:** R6 objects,
80
- unlike most other objects in R, have reference semantics. An important
81
- consequence of this is that objects are not copied when modified.
82
-
83
- ``` {r}
84
- original_value <- x$DT$percent_cli[1]
85
- y <- x # This DOES NOT make a copy of x
86
- y$DT$percent_cli[1] = 0
87
- head(y$DT)
88
- head(x$DT)
89
- x$DT$percent_cli[1] <- original_value
90
- ```
91
76
92
- To make a copy, we can use the ` clone() ` method for an R6 class, as in `y <-
93
- x$clone()`. You can read more about reference semantics in Hadley Wickham's
94
- [ Advanced R] ( https://adv-r.hadley.nz/r6.html#r6-semantics ) book.
77
+ In general, the last version of each observation is carried forward (LOCF) to
78
+ fill in data between recorded versions.
95
79
96
80
## Some details on metadata
97
81
98
82
The following pieces of metadata are included as fields in an ` epi_archive `
99
- object:
83
+ object:
100
84
101
85
* ` geo_type ` : the type for the geo values.
102
86
* ` time_type ` : the type for the time values.
@@ -112,10 +96,8 @@ call (as it did in the case above).
112
96
113
97
A key method of an ` epi_archive ` class is ` as_of() ` , which generates a snapshot
114
98
of the archive in ` epi_df ` format. This represents the most up-to-date values of
115
- the signal variables as of a given version. This can be accessed via ` x$as_of() `
116
- for an ` epi_archive ` object ` x ` , but the package also provides a simple wrapper
117
- function ` epix_as_of() ` since this is likely a more familiar interface for users
118
- not familiar with R6 (or object-oriented programming).
99
+ the signal variables as of a given version. This can be accessed via
100
+ ` epix_as_of() ` .
119
101
120
102
``` {r}
121
103
x_snapshot <- epix_as_of(x, max_version = as.Date("2021-06-01"))
@@ -125,7 +107,7 @@ max(x_snapshot$time_value)
125
107
attributes(x_snapshot)$metadata$as_of
126
108
```
127
109
128
- We can see that the max time value in the ` epi_df ` object ` x_snapshot ` that was
110
+ We can see that the max time value in the ` epi_df ` object ` x_snapshot ` that was
129
111
generated from the archive is May 29, 2021, even though the specified version
130
112
date was June 1, 2021. From this we can infer that the doctor's visits signal
131
113
was 2 days latent on June 1. Also, we can see that the metadata in the ` epi_df `
@@ -134,7 +116,7 @@ object has the version date recorded in the `as_of` field.
134
116
By default, using the maximum of the ` version ` column in the underlying data table in an
135
117
` epi_archive ` object itself generates a snapshot of the latest values of signal
136
118
variables in the entire archive. The ` epix_as_of() ` function issues a warning in
137
- this case, since updates to the current version may still come in at a later
119
+ this case, since updates to the current version may still come in at a later
138
120
point in time, due to various reasons, such as synchronization issues.
139
121
140
122
``` {r}
@@ -143,15 +125,15 @@ x_latest <- epix_as_of(x, max_version = max(x$DT$version))
143
125
144
126
Below, we pull several snapshots from the archive, spaced one month apart. We
145
127
overlay the corresponding signal curves as colored lines, with the version dates
146
- marked by dotted vertical lines, and draw the latest curve in black (from the
128
+ marked by dotted vertical lines, and draw the latest curve in black (from the
147
129
latest snapshot ` x_latest ` that the archive can provide).
148
130
149
131
``` {r, fig.width = 8, fig.height = 7}
150
132
self_max <- max(x$DT$version)
151
133
versions <- seq(as.Date("2020-06-01"), self_max - 1, by = "1 month")
152
134
snapshots <- map(
153
- versions,
154
- function(v) {
135
+ versions,
136
+ function(v) {
155
137
epix_as_of(x, max_version = v) %>% mutate(version = v)
156
138
}) %>%
157
139
list_rbind() %>%
@@ -162,37 +144,35 @@ snapshots <- map(
162
144
``` {r, fig.height=7}
163
145
#| code-fold: true
164
146
ggplot(snapshots %>% filter(!latest),
165
- aes(x = time_value, y = percent_cli)) +
166
- geom_line(aes(color = factor(version)), na.rm = TRUE) +
147
+ aes(x = time_value, y = percent_cli)) +
148
+ geom_line(aes(color = factor(version)), na.rm = TRUE) +
167
149
geom_vline(aes(color = factor(version), xintercept = version), lty = 2) +
168
150
facet_wrap(~ geo_value, scales = "free_y", ncol = 1) +
169
151
scale_x_date(minor_breaks = "month", date_labels = "%b %Y") +
170
152
scale_color_viridis_d(option = "A", end = .9) +
171
- labs(x = "Date", y = "% of doctor's visits with CLI") +
153
+ labs(x = "Date", y = "% of doctor's visits with CLI") +
172
154
theme(legend.position = "none") +
173
155
geom_line(data = snapshots %>% filter(latest),
174
- aes(x = time_value, y = percent_cli),
156
+ aes(x = time_value, y = percent_cli),
175
157
inherit.aes = FALSE, color = "black", na.rm = TRUE)
176
158
```
177
159
178
160
We can see some interesting and highly nontrivial revision behavior: at some
179
161
points in time the provisional data snapshots grossly underestimate the latest
180
162
curve (look in particular at Florida close to the end of 2021), and at others
181
- they overestimate it (both states towards the beginning of 2021), though not
163
+ they overestimate it (both states towards the beginning of 2021), though not
182
164
quite as dramatically. Modeling the revision process, which is often called
183
165
* backfill modeling* , is an important statistical problem in it of itself.
184
166
185
167
186
- ## Merging ` epi_archive ` objects
168
+ ## Merging ` epi_archive ` objects
187
169
188
170
Now we demonstrate how to merge two ` epi_archive ` objects together, e.g., so
189
171
that grabbing data from multiple sources as of a particular version can be
190
- performed with a single ` as_of ` call. The ` epi_archive ` class provides a method
191
- ` merge() ` precisely for this purpose. The wrapper function is called
192
- ` epix_merge() ` ; this wrapper avoids mutating its inputs, while ` x$merge ` will
193
- mutate ` x ` . Below we merge the working ` epi_archive ` of versioned percentage CLI
194
- from outpatient visits to another one of versioned COVID-19 case reporting data,
195
- which we fetch the from the [ COVIDcast
172
+ performed with a single ` as_of ` call. The ` epiprocess ` packages provides
173
+ ` epix_merge() ` for this purpose. Below we merge the working ` epi_archive ` of
174
+ versioned percentage CLI from outpatient visits to another one of versioned
175
+ COVID-19 case reporting data, which we fetch the from the [ COVIDcast
196
176
API] ( https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html/ ) , on the
197
177
rate scale (counts per 100,000 people in the population).
198
178
@@ -209,7 +189,7 @@ When merging archives, unless the archives have identical data release patterns,
209
189
the other).
210
190
211
191
``` {r, message = FALSE, warning = FALSE,eval=FALSE}
212
- # This code is for illustration and doesn't run.
192
+ # This code is for illustration and doesn't run.
213
193
# The result is saved/loaded in the (hidden) next chunk from `{epidatasets}`
214
194
y <- covidcast(
215
195
data_source = "jhu-csse",
@@ -224,24 +204,13 @@ y <- covidcast(
224
204
select(geo_value, time_value, version = issue, case_rate_7d_av = value) %>%
225
205
as_epi_archive(compactify = TRUE)
226
206
227
- x$merge( y, sync = "locf", compactify = FALSE)
207
+ x <- epix_merge(x, y, sync = "locf", compactify = FALSE)
228
208
print(x)
229
209
head(x$DT)
230
210
```
231
211
232
- ``` {r, echo=FALSE}
233
- x <- archive_cases_dv_subset
234
- print(x)
235
- head(x$DT)
236
- ```
237
-
238
- Importantly, see that ` x$merge ` mutated ` x ` to hold the result of the merge. We
239
- could also have used ` xy = epix_merge(x, y) ` to avoid mutating ` x ` . See the
240
- documentation for either for more detailed descriptions of what mutation,
241
- pointer aliasing, and pointer reseating is possible.
242
-
243
212
## Sliding version-aware computations
244
-
213
+
245
214
::: {.callout-note}
246
215
TODO: need a simple example here.
247
216
:::
0 commit comments