Skip to content

Commit aa53b01

Browse files
committed
initiating git for data processing repo
0 parents  commit aa53b01

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+5131
-0
lines changed

DataProcessing.Rproj

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
Version: 1.0
2+
3+
RestoreWorkspace: No
4+
SaveWorkspace: No
5+
AlwaysSaveHistory: Default
6+
7+
EnableCodeIndexing: Yes
8+
UseSpacesForTab: Yes
9+
NumSpacesForTab: 5
10+
Encoding: UTF-8
11+
12+
RnwWeave: Sweave
13+
LaTeX: pdfLaTeX

notebooks/.DS_Store

6 KB
Binary file not shown.

notebooks/json-parsing-chicago.Rmd

Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
---
2+
title: "JSON-Parsing Survey Tasks: Chicago"
3+
output: html_notebook
4+
---
5+
6+
This code flattens the Chicago Wildlife Watch data.
7+
8+
```{r}
9+
library(tidyjson)
10+
library(magrittr)
11+
library(jsonlite)
12+
library(dplyr)
13+
library(stringr)
14+
library(tidyr)
15+
16+
chicago_unfiltered <- read.csv("../data/chicago-wildlife-watch-classifications.csv", stringsAsFactors = F)
17+
```
18+
19+
First, we need to limit the classification data to the final workflow version and, if necessary, split by task. T0 is clearly the only task we really care about in this dataset (though note the changed format of current site).
20+
21+
```{r}
22+
# check which workflow version we want:
23+
chicago_unfiltered %>% summarise(., n_distinct(subject_ids), n_distinct(classification_id), n_distinct(workflow_version))
24+
25+
quick_check <- chicago_unfiltered %>%
26+
select(., subject_ids, classification_id, workflow_version, annotations) %>%
27+
as.tbl_json(json.column = "annotations") %>%
28+
gather_array(column.name = "task_index") %>% # really important for joining later
29+
spread_values(task = jstring("task"), task_label = jstring("task_label"), value = jstring("value")) %>%
30+
gather_keys %>%
31+
append_values_string()
32+
33+
quick_check %>% data.frame %>% group_by(., workflow_version, key, task) %>% summarise(., classification_count = n()) %>% print
34+
35+
```
36+
37+
So filter to the appropriate workflow and get going! Let's take a quick peek at the data.
38+
39+
```{r}
40+
chicago <- chicago_unfiltered %>% filter(., workflow_version == 397.41)
41+
chicago$annotations[1] %>% prettify()
42+
```
43+
44+
45+
```{r}
46+
# preliminary flat
47+
basic_flat_with_values <- chicago %>%
48+
select(., subject_ids, classification_id, workflow_version, annotations) %>%
49+
as.tbl_json(json.column = "annotations") %>%
50+
gather_array(column.name = "task_index") %>% # really important for joining later
51+
spread_values(task = jstring("task"), task_label = jstring("task_label"), value = jstring("value"))
52+
53+
basic_flat_with_values %>% data.frame %>% head
54+
55+
chicago_summary <- basic_flat_with_values %>%
56+
gather_keys %>%
57+
append_values_string()
58+
59+
chicago_summary %>% data.frame %>% head # this will have all the classification IDs; if Value is empty, then the field will be null. This will have multiple rows per classification if there are multiple tasks completed
60+
61+
chicago_summary %>% data.frame %>% group_by(., workflow_version, key, task) %>% summarise(., n())
62+
63+
# quick check the filtered original data
64+
chicago %>% summarise(., n_distinct(subject_ids), n_distinct(classification_id), n_distinct(workflow_version))
65+
```
66+
67+
Now dive into the first nested object, the species choice. Note that if you have different task types that you haven't filtered out, or if you have null objects, this might break or else drop rows.
68+
69+
```{r}
70+
# grab choices; append embedded array values just for tracking
71+
# Note that this will break if any of the tasks are simple questions. You would need to split by task before here.
72+
chicago_choices <- basic_flat_with_values %>%
73+
enter_object("value") %>% json_lengths(column.name = "total_species") %>%
74+
gather_array(column.name = "species_index") %>% #each classification is an array. so you need to gather up multiple arrays.
75+
spread_values(choice = jstring("choice"), answers = jstring("answers")) #append the answers as characters just in case
76+
77+
# if there are multiple species ID'd, there will be multiple rows and array.index will be >1
78+
chicago_choices %>% data.frame %>% head
79+
chicago_choices %>% group_by(., classification_id) %>% summarise(., count = n(), max(species_index)) %>% arrange(., -count)
80+
```
81+
82+
Now dive into the second nested object, which is the sub questions. Since these actually aren't arrays, it's okay if they're empty! This still keeps the rows.
83+
```{r}
84+
# grab answers - for some reason, this keeps rows even if there are no answers!
85+
# Note that this last bit is the part that would need to be customized per team, I think
86+
chicago_answers <- chicago_choices %>%
87+
enter_object("answers") %>%
88+
spread_values(how_many = jstring("HWMN"), wow = jstring("CLCKWWFTHSSNWSMPHT"), off_leash = jstring("CLCKSFDGSFFLSH"))
89+
90+
chicago_answers %>% data.frame %>% head
91+
#chicago_answers %>% group_by(classification_id) %>% summarise(., n())
92+
```
93+
94+
Put everything back together, which is important if you've dropped rows because of empty arrays and things.
95+
```{r}
96+
# in theory, you want to tie all of these back together just in case there are missing values
97+
add_choices <- left_join(basic_flat_with_values, chicago_choices)
98+
tot <- left_join(add_choices, chicago_answers)
99+
flat_data <- tot %>% select(., -task_index, -task_label, -value, -answers)
100+
101+
flat_data %>% data.frame %>% head
102+
```
103+
104+
Here's your file out!
105+
```{r}
106+
write.csv(flat_data, file = "../data/chicago-flattened.csv")
107+
```
108+

notebooks/json-parsing-chicago.nb.html

Lines changed: 439 additions & 0 deletions
Large diffs are not rendered by default.

notebooks/json-parsing-examples.Rmd

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
---
2+
title: "R Notebook"
3+
output: html_notebook
4+
---
5+
6+
```{r}
7+
library(tidyjson)
8+
library(magrittr)
9+
library(jsonlite)
10+
library(dplyr)
11+
```
12+
13+
14+
# JSON Parsing
15+
16+
Each classification is an array. Depending on the workflow and how it's changed, classification arrays may vary in structure within a single project. Also, empty arrays seem to be problematic. Depending on the type of project, you probably want to split the data into workflows and even limit workflow version prior to flattening.
17+
18+
---
19+
20+
#### Load example data
21+
```{r load example data}
22+
sas <- read.csv("../data/questions-SAS-1000.csv", stringsAsFactors = F)
23+
kitteh <- read.csv("../data/kitteh-zoo-classifications.csv", stringsAsFactors = F)
24+
wilde <- read.csv("../data/points-wildebeest.csv", stringsAsFactors = F)
25+
chicago <- read.csv("../data/chicago-wildlife-watch-classifications.csv", stringsAsFactors = F)
26+
27+
```
28+
29+
#### Simple Yes or No Questions
30+
31+
```{r display example annotation formats}
32+
sas$annotations[1] %>% prettify
33+
```
34+
35+
#### Simple Point Marking
36+
```{r}
37+
wilde$annotations[2] %>% prettify()
38+
```
39+
#### Combination Question and Marking: Note that the fomat of the value array varies by task
40+
```{r}
41+
kitteh$annotations[1] %>% prettify
42+
43+
```
44+
45+
# Flattening the Files
46+
47+
It's much easier to parse/flatten the JSON when everything is in a standard format, so you probably want to split out your raw file based on the Workflow and even Task IDs. You also want to limit to only the workflow version(s) with actual data. This is because previous versions, especially those with empty data, may have different structures for the classification data, which is annoying and problematic.
48+
49+
Note: you may need to dig into your raw data a bit to identify which workflow and version you need. Some projects have many workflows and versions, others not so many.
50+
51+
```{r workflow_fun_definition}
52+
fun_check_workflow <- function(data){
53+
data %>% group_by(workflow_id, workflow_version) %>%
54+
summarise(date = max(created_at), count = n()) %>%
55+
print
56+
}
57+
```
58+
For example: This is the Snapshots at Sea classifications by workflow
59+
60+
```{r}
61+
sas %>% fun_check_workflow()
62+
```
63+
64+
vs. that of Wildebeest Marking Project
65+
```{r}
66+
wilde %>% fun_check_workflow()
67+
```
68+
Vs. Chicago Wildlife Watch
69+
```{r}
70+
chicago %>% fun_check_workflow()
71+
```
72+
73+
## Basic Flattening
74+
75+
With jsonlite, you can basically flatten all of the json data into a series of nested lists. This works really well for simple data, like questions, but marking tasks and more complex workflows get a bit complicated.
76+
77+
```{r flattening }
78+
library(jsonlite)
79+
80+
#Basic Flattening Function
81+
basic_flattening <- function(jdata) {
82+
out <- list() #create list to hold everything
83+
84+
for (i in 1:dim(jdata)[1]) { #loop through each row of the dataset at a time
85+
classification_id <- jdata$classification_id[i]
86+
subject_id <- jdata$subject_ids[i]
87+
split_anno <- fromJSON(txt = jdata$annotations[i], simplifyDataFrame = T)
88+
out[[i]] <- cbind(classification_id, subject_id, split_anno)
89+
}
90+
91+
do.call(what = rbind, args = out)
92+
}
93+
94+
```
95+
96+
Single questions flatten alright
97+
```{r flatten sas}
98+
flat_sas <- sas %>% basic_flattening()
99+
str(flat_sas)
100+
```
101+
102+
But more complex questions produce embedded lists inside the "value" column.
103+
104+
105+
```{r}
106+
flat_wilde <- wilde[1:10,] %>% basic_flattening()
107+
str(flat_wilde, max.level = 2)
108+
```
109+
110+
```{r}
111+
flat_kitteh <- kitteh %>% basic_flattening()
112+
str(flat_kitteh, max.level = 3)
113+
```
114+

notebooks/json-parsing-examples.nb.html

Lines changed: 501 additions & 0 deletions
Large diffs are not rendered by default.

notebooks/json-parsing-michigan.Rmd

Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
---
2+
title: "JSON parsing: survey tasks: with multiple-choice subquestions"
3+
output: html_notebook
4+
---
5+
6+
This project has two tasks in their workflow, a survey task and a follow up question task asking about the weather. The survey task also has subquestions that ask the volunteer to select all that apply, meaning we have an extra step to flatten out the annotations
7+
8+
```{r}
9+
library(tidyjson)
10+
library(magrittr)
11+
library(jsonlite)
12+
library(dplyr)
13+
library(stringr)
14+
library(tidyr)
15+
library(lubridate)
16+
```
17+
```{r}
18+
19+
jdata_unfiltered <- read.csv(file = "../data/michigan-zoomin-classifications.csv", stringsAsFactors = F)
20+
21+
# you'd probably need to include multiple versions (as these likely have minor text changes, but for this demo we'll choose 463.55)
22+
jdata_unfiltered %>% mutate(., created_at = ymd_hms(created_at)) %>%
23+
group_by(., workflow_id, workflow_version) %>% summarise(., max(created_at), n()) %>% head
24+
25+
26+
jdata <- jdata_unfiltered %>% filter(., workflow_version == 463.55) %>% head(., n = 5000)
27+
jdata %>% summarise(., n_distinct(subject_ids), n_distinct(classification_id), n_distinct(workflow_version))
28+
29+
```
30+
31+
Take a peek at the data structure. There are two tasks, and within the survey task, only some species have subquestions.
32+
```{r}
33+
############### SURVEY TASK
34+
head(jdata)
35+
for (i in 15:17) {
36+
jdata$annotations[i] %>% prettify %>% print
37+
}
38+
```
39+
40+
```{r}
41+
# preliminary flat
42+
43+
basic_flat_with_values <- jdata %>%
44+
select(., subject_ids, classification_id, workflow_version, annotations) %>%
45+
as.tbl_json(json.column = "annotations") %>%
46+
gather_array(column.name = "task_index") %>% # really important for joining later
47+
spread_values(task = jstring("task"), task_label = jstring("task_label"), value = jstring("value"))
48+
49+
basic_flat_with_values %>% data.frame %>% head
50+
51+
basic_summary %>% data.frame %>% group_by(., workflow_version, key, task) %>% summarise(., n())
52+
```
53+
54+
```{r}
55+
56+
#--------------------------------------------------------------------------------#
57+
# split into survey vs. non-survey data frames. Question is flattened and can be exported as a separate file now.
58+
survey <- basic_flat_with_values %>% filter(., task == "T3")
59+
question <- basic_flat_with_values %>% filter(., task == "T2")
60+
61+
###----------------------------### SURVEY FLATTENING ###----------------------------###
62+
63+
# grab choices; Species_index lists how many species were recorded in a given classification. (Usually maxes at 2...
64+
with_choices <- survey %>%
65+
enter_object("value") %>% json_lengths(column.name = "total_species") %>%
66+
gather_array(column.name = "species_index") %>% #each classification is an array. so you need to gather up multiple arrays.
67+
spread_values(choice = jstring("choice"))
68+
69+
# if there are multiple species ID'd, there will be multiple rows and array.index will be >1
70+
with_choices %>% data.frame %>% head
71+
with_choices %>% summarise(., n_distinct(subject_ids), n_distinct(classification_id))
72+
```
73+
74+
Let's start the process of grabbing and flattening the nested data. Note that this section requires you to reference the specific suquestion labels, so if they change throughout the life of your project, you MUST create a script to handle the revisions.
75+
```{r}
76+
# grab answers. Note that the spread_values() function needs to be customized per team and subquestion label.
77+
78+
with_answers <- with_choices %>%
79+
enter_object("answers") %>%
80+
spread_values(how_many = jstring("HOWMANYANIMALSDOYOUSEE")) %>%
81+
enter_object("WHATISTHEANIMALSDOING") %>% #enter into the list of behaviors
82+
gather_array("behavior_index") %>% #gather into one behavior per row
83+
append_values_string("behavior")
84+
85+
# note that behaviors are listed in a "long" format, but this is probably unwieldy.
86+
with_answers %>% data.frame %>% head
87+
```
88+
89+
Let's spread out the answers into individual columns with 1/0 indicators for whether or not that behavior was identified.
90+
```{r}
91+
# spread answers (into separate columns): have to drop behavior index or else the rows won't combine!
92+
with_answers_spread <- with_answers %>% data.frame %>%
93+
select(., -behavior_index) %>%
94+
mutate(., behavior_present = 1) %>%
95+
spread(., key = behavior, value = behavior_present, fill = 0)
96+
97+
with_answers_spread %>% data.frame %>% head
98+
with_answers_spread %>% summarise(., n_distinct(subject_ids), n_distinct(classification_id))
99+
100+
```
101+
102+
You could, also, in theory, create a column that contains an actual list of the behaviors. Note that the values look similar to how tidyjson would display them, but they are actual lists instead of character strings that say "list(...)"
103+
```{r}
104+
# spread answers (into a list)
105+
test <- with_answers %>% data.frame %>%
106+
select(., -behavior_index) %>% nest(behavior)
107+
108+
109+
with_answers %>% data.frame %>% head
110+
```
111+
112+
```{r}
113+
# in theory, you want to tie all of these back together just in case there are missing values
114+
add_choices <- left_join(survey, with_choices)
115+
tot <- left_join(add_choices, with_answers_spread)
116+
flat_data <- tot %>% select(., -task_index, -task_label, -value)
117+
118+
flat_data %>% data.frame %>% head
119+
```
120+
121+
```{r}
122+
#check that the number of distinct subject IDs and classification IDs is still the same
123+
flat_data %>% summarise(., n_distinct(subject_ids), n_distinct(classification_id), n()) #flattened,
124+
jdata %>% summarise(., n_distinct(subject_ids), n_distinct(classification_id), n()) #original
125+
126+
#save your files for aggregation!
127+
write.csv(flat_data, file = "../data/T3-flattened.csv")
128+
write.csv(question, file = "../data/T2-flattened.csv")
129+
130+
```

notebooks/json-parsing-michigan.nb.html

Lines changed: 535 additions & 0 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)