-
Notifications
You must be signed in to change notification settings - Fork 169
/
Copy pathExercise_03_BigData.Rmd
355 lines (264 loc) · 21.1 KB
/
Exercise_03_BigData.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
Exercise 3: Tools for big data
========================================================
The objective of today's exercise is to provide a quick introduction to some common tools for dealing with big data. For each tool we are just using the most basic syntax and you are encouraged to go back and read the help for each at a later date. This exercise also focuses on "general purpose" tools. There are a multitude of R libraries available for accessing specific data sources and web services. A quick summary of some of these is available at http://cran.r-project.org/web/views/WebTechnologies.html. In addition, a Google search on many of the tools and topics covered in Chapters 3 and 4 will provide a lot of additional info on big data tools outside of R.
Note: The code in this exercise will download data off the web dynamically, which can take some time, so try to "knit" infrequently.
```{r,echo=FALSE}
## since libraries will be pulled, make sure repository is set
repos = "http://cran.us.r-project.org"
get.pkg <- function(pkg){
loaded <- do.call("require",list(package=pkg))
if(!loaded){
print(paste("trying to install",pkg))
install.packages(pkg,dependencies=TRUE,repos=repos)
loaded <- do.call("require",list(package=pkg))
if(loaded){
print(paste(pkg,"installed and loaded"))
}
else {
stop(paste("could not install",pkg))
}
}
}
get.pkg("RCurl")
get.pkg("readr")
get.pkg("XML")
get.pkg("arrow")
get.pkg("devtools")
get.pkg("MODISTools")
get.pkg("EML")
get.pkg("cronR")
get.pkg("tidyverse")
```
Pulling data directly off the web
---------------------------------
In the previous exercises we loaded data into R using functions like read.csv. However, it is also possible to read data into R directly off the web by passing a web address to the file name. For smaller files that are quick to load this approach can ensure that the script is always operating with the most up-to-date version of a data file.
```{r}
gflu = readr::read_csv("https://raw.githubusercontent.com/EcoForecast/EF_Activities/master/data/gflu_data.txt",skip = 11)
time = as.Date(gflu$Date)
plot(time,gflu$"Boston, MA",type='l')
```
That said, for publication purposes it is usually important to save the data that you used for an analysis, and that the date of access is recorded (and version number if available), as some datasets are subject to frequent revision.
In this example, the file in question has an extensive header, which we skip during the load of the data, but as with any dataset, this metadata is important to read before using the data.
```
Google Flu Trends - United States
Copyright 2013 Google Inc.
Exported data may be used for any purpose, subject to the Google Terms of Service (http://www.google.com/accounts/TOS?hl=en_US).
If you choose to use the data, please attribute it to Google as follows: "Data Source: Google Flu Trends (http://www.google.org/flutrends)".
Each week begins on the Sunday (Pacific Time) indicated for the row.
Data for the current week will be updated each day until Saturday (Pacific Time).
Note: To open these files in a spreadsheet application, we recommend you save each text file as a CSV spreadsheet.
For more information, please visit http://www.google.org/flutrends
```
**Question 1: [A]**
Make a simple time-series plot of the Harvard Forest (HARV) Phenocam phenology data (gcc_90), which EFI has preprocessed and saved here https://data.ecoforecast.org/neon4cast-targets/phenology/phenology-targets.csv.gz as part of the NEON forecasting challenge. Don't forget to filter this data by site and variable!
grep, system, RegExp
--------------------
## GREP
`grep` is a handy little _command prompt_ function that returns lines from a file that match a search string. I continue to use this 'old school' utility on a daily basis to help manage code and data because this simple little search continues to be able to perform actions that elude newer search software:
- `grep` looks within files, but is able to search across file and recursively within a directory structure. I use this constantly to follow variables or functions through complex code. For example, if I wanted to find uses of the term _phenology_ in my current directory and all subdirectories, I could type the following in my Terminal (not my R prompt)
```
grep -ir "phenology" .
### Note: Some Windows users won't have grep installed but
### will have a similar function, findstr
### system("cmd.exe",input='findstr "phenology" *.Rmd')
### findstr is more limited and it is OK to skip examples that don't run
```
here the -i means ignore case when searching, the -r means to search recursively through subdirectories, and the `.` means to start from the current directory. Used in this way grep can help you quickly find your way through new and complex code, iteratively hopping through the code from one search to another. It is also extremely helpful in debugging, when a program returns a cryptic error message and you want to find _where_ in the code that message was generated.
- `grep` returns the full lines/rows that match a search, allowing one to quickly and easily subset large datasets into smaller files and/or merge subsets across different files.
## RegExp
- `grep` supports **Regular Expressions**, both within the search itself and in the set of filenames searched. For example, if we wanted to find all lines that contain 'phenology', in all the `.Rmd` files in the current directory we could type
```
grep -ir 'phenology' *.Rmd
```
where the * means 'match zero or more occurances of any character', in this case preceeding .Rmd (the zero part means this would match a file just named .Rmd). If I just wanted to find instances where `phenology` is at the start of the line I could use the `^` to indicate the beginning of the line
```
grep -ir '^phenology' *.Rmd
```
If I instead wanted to broaden my search to instances where `pheno` is followed immediately by another letter I could use [a-z] to match just letters in the English alphabet, which would pick up phenological and phenocam.
```
grep -ir 'pheno[a-z]' *.Rmd
```
or I could be more specific an just look for specific letters, e.g. pheno[cl] would match phenoc and phenol but not phenom. A full description of regular expressions is beyond the scope of this tutorial, and RegExp statements for matching complex patterns can quickly become cryptic, so following up on this further is left to the reader.
## system()
There are often times when working in R that one needs to run another command, script, or piece of software that is external to R. If I'm working in an R script want the operating system to run a command I can do this with the `system` command
```{r}
system('grep -ir "pheno" *.Rmd')
```
Furthermore, often we want to capture the output of that command directly into R, which we can do using the `intern` flag:
```{r}
pheno.lines = system('grep -ir "pheno" *.Rmd',intern=TRUE)
pheno.lines[1:3]
```
## grep()
Finally, it is also worth mentioning that R has its own, internal, version of grep that can be useful for searching and subsetting data and which also supports RegExp. Unlike the command-line version of grep, this function returns the row numbers matching the search string. In the example below we use the function readLines to read unstructured text in as vector of strings, one corresponding to each row in a file. It also demonstrates the function `sub`, which is related to grep but which substitutes the matching string rather than just finding it.
```{r}
myCode = readLines("Exercise_03_BigData.Rmd") ## read unstructured text
x = grep("HARV",myCode) ## returns the line numbers that include the string 'HARV'
myCode[x]
sub("HARV","BART",myCode[x]) ## substitute FIRST: HARV for BART
gsub("HARV","BART",myCode[x]) ## substitute ALL: HARV for BART
```
**Question 2: [B]** Within the object myCode, find all the lines that begin with the comment character, #.
Using APIs
----------
In addition to data that are directly downloadable there are a number of places on the web where data are available though interactive, code-based webservices called Application Programming Interfaces (APIs). In this example we will access the NASA MODIS API, using a pre-existing R package called MODISTools, as a demonstration of one of the many dataset-specific R packages.
First, we'll query the MODIS server to see what data products are available and what variables (bands) are available within one of those data products. More details about each data product is available at https://modis.ornl.gov/sites/?id=us_massachusetts_harvard_forest_neon
where we see that the tool has been expanded to also include data products for VIIRS, SMAP, DAYMET, GEDI, ECOSTRESS and SIF as well.
```{r}
MODISTools::mt_products()
MODISTools::mt_bands(product="MOD13Q1") ## vegetation indices
```
Next, lets grab the data for a specific band (EVI) within a specific product (MOD13Q1). We'll focus on the NEON Harvard Forest (HARV) locationand look at one of the same years as we did with the phenology data from question 1. The argument Size defines the dimensions of the box grabbed in terms of distance (in kilometers) outward from the center. Note that in practice we would also want to query the QAQC data for this variable, `250m_16_days_VI_Quality`, as well and use it to screen the data.
```{r}
EVI_file = "MODIS.HARV.RData"
if(file.exists(EVI_file)){
load(EVI_file)
} else {
subset <- MODISTools::mt_subset(product = "MOD13Q1",
band = "250m_16_days_EVI",
lat=42.5369,
lon=-72.1727,
start="2021-01-01",
end="2021-12-31",
km_lr = 1,
km_ab = 1,
site_name = "HARV")
save(subset,file=EVI_file)
}
## let's look at the first few rows to get a feel for the structure
head(subset)
```
Here we extracted a 250m data products and looked +/ 1km in both directions, which gives us a 9x9 area and thus 81 pixels.
```{r}
unique(subset$pixel)
```
For this example lets average over the spatial data and just generate a time-series of EVI.
```{r}
## average EVI spatially & use 'scale' to set units
EVI = tapply(subset$value*as.numeric(subset$scale), subset$calendar_date, mean,na.rm=TRUE)
time = as.Date(names(EVI))
```
**Question 3: [B]** Plot EVI versus time and compare to the NEON phenocam observations.
FYI, APIs now exist for an enormous range of data sources that are available on the web, as well as a range of web-services that allow you to not just download data but to also upload data or push requests various cloud platforms.
Cloud-Native data
-----------------
One area of working with big data that has evolved rapidly since work began on the Ecological Forecasting book is the growth of cloud data storage and cloud-native file formats. Consistent with the ideas in the book there are RAM-limited and storage-limited problems, beyond which data do not fit in a single file or on a single computer, which necessitate _tiling_ data across multiple files that potentially live on multiple computers. Cloud-native data storage formats now make the generation and access of such tiled data seamless, such that a huge dataset that is spread across many individual files in the cloud might behave like a single dataframe. Standard operations (e.g. filtering) can then be done on such data without having to download individual files to your own computer, and with the remote files organized in a way that the system only has to touch the ones that actually contain the files of interest.
File formats that have increased in use in recent year include parquet for tabular data, cloud-native GeoTIFF for raster spatial data, and zarr for mixed-format and high-dimensional data (e.g. an array that is x, y, z, time, attributes).
R support for parquet comes via the Apache `arrow` package. For example, if we want to connect to the Ecological Forecasting Initiative's catalog of ecological forecasts, which are stored in a Amazon-style S3 bucket (a cloud native data store that itself abstracts away the concept of drives, folders, and file paths) we could do so using the syntax:
```{r}
if(file.exists("pheno.Rdata")){
load("pheno.Rdata")
} else{ ## to make knit faster, only download the data the first time
s3 <- arrow::s3_bucket(bucket = "neon4cast-scores/parquet/phenology",
endpoint_override = "data.ecoforecast.org",
anonymous=TRUE)
ds <- arrow::open_dataset(s3)
ds
df <- ds |>
filter(model_id == "climatology",
variable=="gcc_90",
site_id == "HARV") |>
collect()
save(df,file="pheno.Rdata")
} ## end download
head(df)
range(df$reference_datetime)
```
In the first part of the code, ds is just a connection to a cloud service, but it behaves like a dataframe (e.g., the line with just `ds` will display the list of variable names and variable types for the cloud-based dataframe). Next, we can use standard tidyverse syntax to process this file (e.g. filtering by model, variable, and location) and only when we call the command `collect()` does the processing actually execute in the cloud and send us the data we're requesting. For example, in the command above we request the phenology forecasts of the Phenocam greeness variable `gcc_90` from the `climatology` model for the NEON Harvard Forest site (HARV).
Data tiling is also a particularly useful feature for forecasting, as some such formats let you add new data without having to touch the previously written files. Thus one can append additional days of forecast inputs and outputs incrementally without having to load a large file into memory or rewrite it to storage.
For example, if we split the EFI phenology data into two sections, we can write each section out separately in parquet
```{r}
## split data
dfa <- df |> mutate(reference_datetime = lubridate::as_date(reference_datetime)) |>
filter(reference_datetime < lubridate::as_date("2022-01-01"))
dfb <- df |> mutate(reference_datetime = lubridate::as_date(reference_datetime)) |>
filter(reference_datetime >= lubridate::as_date("2022-01-01"))
## write out to the folder "parquetTest", tiling by reference_datetime
dir.create("parquetTest")
dfa |> group_by(reference_datetime) |> arrow::write_dataset("parquetTest",format="parquet")
dfb |> group_by(reference_datetime) |> arrow::write_dataset("parquetTest",format="parquet")
```
If you look in the folder you'll see that the files are organized by reference_datetime. We can now access these files as we did before, using `arrow::open_dataset` to open the whole _folder_. As before the resulting object behaves like a dataframe, even though it's actually pointing to a whole folder, and one can operate on this object without having to load all the individual files.
```{r}
dfc = arrow::open_dataset("parquetTest")
dfc
```
**Question 4: [A]** Use arrow to access the NOAA GEFS weather forecast and filter this data to the forecast issued yesterday at midnight for the NEON site_id "HARV". The EFI NEON challenge stores this data at the endpoint "data.ecoforecast.org" in a S3 bucket named "neon4cast-drivers/noaa/gefs-v12/stage2/parquet/0/2024-01-01". Note in this case that once could select a different data by changing the last folder referenced, or could access multiple dates by leaving of a date at the end of the bucket. Make a timeseries plot of the "air_temperature" data in this forecast that shows one curve for each ensemble member (there are 31 ensemble members total). Hint: make sure to filter all relevant variables before calling `collect`.
EML Metadata
------------
Data is usually only as the valuable as the meta-data associated with it. While some data is internally documented (e.g. netCDF) or highly standardized (e.g. MODIS), most data need to have external documentation.
A particularly common metadata standard in ecology is the Ecological Metadata Language (EML). This is also the standard that the Ecological Forecasting Initative built upon when developing a community standard for documenting forecasts outputs (yes, forecasts need metadata too!). In the example below we explore the structure and content of the metadata for a simple forecast.
```{r}
## load example metadata from the EFI standard github repo
md <- read_eml("https://raw.githubusercontent.com/eco4cast/EFIstandards/master/inst/extdata/forecast-eml.xml")
```
We'll then use `eml_get` to extract basic information about the forecast
```{r}
eml_get(md, "title")
eml_get(md, "abstract")
eml_get(md, "creator")
```
Next, we can learn about the spatial, temporal, and taxonomic coverage of the forecast
```{r}
eml_get(md, "coverage")
```
Next, let's look at the structure of the dataset itself
```{r}
dt_md <- eml_get(md, "dataset")
eml_get(dt_md, "physical")
get_attributes(dt_md$dataTable$attributeList)
```
The EFI standard also includes some additional Metadata fields related specifically to what uncertainties are included in a forecast
```{r}
eml_get(md$additionalMetadata, "forecast")
```
In the example above the `complexity` variables record the dimension of each uncertainty term (number of parameters, number of process error variances, etc.).
More information about the EFI metadata standard can be found here https://github.com/eco4cast/EFIstandards and in Dietze et al 2023 Ecosphere https://dx.doi.org/10.1002/ecs2.4686
**Question 5 [A]**
Based on the metadata above, what were the identities of the species_1 and species_2 columns in the forecast file, what units were used, and did this forecast propagate the initial condition uncertainty for these species? For full credit, include code that answers these questions rather than just looking for the answers in the output.
cron
----
The last topic I wanted to touch on isn't data processing per se, but is handy for scheduling the automatic execution of tasks, and thus is frequently used in dynamic big data problems where new data are arriving on a regular basis and analyses need to be updated. An obvious example in the context of this course would be a forecast that would be updated on a daily or weekly basis. [note: like grep, cron is a *nix utility, so will run on linux, unix, and Mac OS, but not Windows].
cron jobs are specified in the cron table using the function `crontab` with takes the arguments -l to list the current contents or -e to edit the contents. The file contains a header component that allows us to specify information such as the shell used (SHELL=), path variables (PATH=), who to email job status updates (MAILTO=), and the directory to start from (HOME=), each on a separate line. Below the header is the table of the cron jobs themselves. A cron job consists of two components, the scheduling information and the command/script/program to be run. Lets take a look at a simple cron table
```
MAILTO=dietze@bu.edu
55 */2 * * * /home/scratch/dietze_lab/NOMADS/get_sref.sh
```
The last part of this is the easiest to explain -- we're running a script called get_sref from the NOMADS folder. NOMADS is the NOAA met server and SREF is one of their weather forecast products, so it should come as no surprise that this script is grabbing the numerical weather forecast. The first part of the script is more cryptic, but the five values given correspond to:
```
minute This controls what minute of the hour the command will run on,
and is between '0' and '59'
hour This controls what hour the command will run on, and is specified in
the 24 hour clock, values must be between 0 and 23 (0 is midnight)
dom This is the Day of Month, that you want the command run on, e.g. to
run a command on the 19th of each month, the dom would be 19.
month This is the month a specified command will run on, it may be specified
numerically (0-12), or as the name of the month (e.g. May)
dow This is the Day of Week that you want a command to be run on, it can
also be numeric (0-7) or as the name of the day (e.g. sun).
```
Values that are not specified explicitly are filled in with a *. Also, it is possible to specify lists (e.g. 0,6,12,18) or to specify a repeat frequency using a /. Thus the above example is set to run every other hour (/2) at 55 min past the hour.
The `cronR` library in R provides an R-based interface to cron, including an intuitive GUI for configuring the automation of scripts
```
cronR::cron_rstudioaddin()
```
as well as a set of command line tools
```
## get the path to the helloworld.R script
f <- system.file(package = "cronR", "extdata", "helloworld.R")
## set up command line syntax
cmd <- cron_rscript(f)
cmd
## add the script to CRON
cron_add(command = cmd, frequency = 'daily', at='7AM', id = 'test2')
## list the number of cron jobs running
cron_njobs()
## list current cron configurations
cron_ls()
## remove a cron job based on id
cron_rm("test2")
```
**Question #6: [A]**
Imagine you are working with MODIS data, but are grabbing a large region (rather than a single pixel) and want to ensure that the data you are using is alwaysd up to date. However, the total size of the database is large and you don't want to completely delete and reinstall the database every day when only a small percentage of the data changes in any update.
* Write out the pseudocode/outline for how to keep the files up to date
* Write out what the cron table would look like to schedule this job (assume the update only needs to be done weekly)