-
Notifications
You must be signed in to change notification settings - Fork 62
/
Copy pathname_cleaning.Rmd
272 lines (217 loc) · 9.57 KB
/
name_cleaning.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
---
title: Strategies for programmatic name cleaning
author: Scott Chamberlain
date: "2020-09-16"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Strategies for programmatic name cleaning}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
`taxize` offers interactive prompts when using `get_*()` functions (e.g., `get_tsn()`).
These prompts make it easy in interactive use to select choices when there are more
than one match found.
However, to make your code reproducible you don't want interactive prompts.
This vignette covers some options for programmatic name cleaning.
```r
library("taxize")
```
## get_* functions
When using `get_*()` functions programatically, you have a few options.
### rows parameter
Normally, if you get more than one result, you get a prompt asking you
to select which taxon you want.
```r
get_tsn("Quercus b")
#> tsn target commonnames nameusage
#> 1 19298 Quercus beebiana not accepted
#> 2 507263 Quercus berberidifolia scrub oak accepted
#> 3 19300 Quercus bicolor swamp white oak accepted
#> 4 19303 Quercus borealis not accepted
#> 5 195131 Quercus borealis var. maxima not accepted
#> 6 195166 Quercus boyntonii Boynton's sand post oak accepted
#> 7 506533 Quercus brantii Brant's oak accepted
#> 8 195150 Quercus breviloba not accepted
#> 9 195099 Quercus breweri not accepted
#> 10 195168 Quercus buckleyi Texas oak accepted
#>
#> More than one TSN found for taxon 'Quercus b'!
#>
#> Enter rownumber of taxon (other inputs will return 'NA'):
#>
#> 1:
```
Instead, we can use the rows parameter to specify which records we want
by number only (not by a name itself). Here, we want the first 3 records:
```r
get_tsn('Quercus b', rows = 1:3)
#> tsn target commonnames nameusage
#> 1 19298 Quercus beebiana not accepted
#> 2 19300 Quercus bicolor swamp white oak accepted
#> 3 19303 Quercus borealis not accepted
#>
#> More than one TSN found for taxon 'Quercus b'!
#>
#> Enter rownumber of taxon (other inputs will return 'NA'):
#>
#> 1:
```
However, you still get a prompt as there is more than one result.
Thus, for full programmatic usage, you can specify a single row, if you happen
to know which one you want:
```r
get_tsn('Quercus b', rows = 3)
#> ══ 1 queries ═══════════════
#> ✔ Found: Quercus b
#> ══ Results ═════════════════
#>
#> ● Total: 1
#> ● Found: 1
#> ● Not Found: 0
#> [1] "19303"
#> attr(,"class")
#> [1] "tsn"
#> attr(,"match")
#> [1] "found"
#> attr(,"multiple_matches")
#> [1] TRUE
#> attr(,"pattern_match")
#> [1] FALSE
#> attr(,"uri")
#> [1] "https://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=19303"
```
In reality it is unlikely you'll know which row you want, unless perhaps you
just want one result from each query, regardless of what it is.
### underscore methods
A better fit for programmatic use are underscore methods. Each `get_*()` function
has a sister method with and trailing underscore, e.g., `get_tsn()` and `get_tsn_()`.
```r
get_tsn_("Quercus b")
#> $`Quercus b`
#> # A tibble: 5 x 4
#> tsn scientificName commonNames nameUsage
#> <chr> <chr> <chr> <chr>
#> 1 19300 Quercus bicolor swamp white oak,chêne bicolore accepted
#> 2 195166 Quercus boyntonii Boynton's sand post oak,Boynton's oak accepted
#> 3 195168 Quercus buckleyi Texas oak,Buckley's oak accepted
#> 4 506533 Quercus brantii Brant's oak accepted
#> 5 507263 Quercus berberidifolia scrub oak accepted
```
The result is a single data.frame for each taxon queried, which can be
processed downstream with whatever logic is required in your workflow.
You can also combine `rows` parameter with underscore functions, as a single
number of a range of numbers:
```r
get_tsn_("Quercus b", rows = 1)
#> $`Quercus b`
#> # A tibble: 1 x 4
#> tsn scientificName commonNames nameUsage
#> <chr> <chr> <chr> <chr>
#> 1 19300 Quercus bicolor swamp white oak,chêne bicolore accepted
```
```r
get_tsn_("Quercus b", rows = 1:2)
#> $`Quercus b`
#> # A tibble: 2 x 4
#> tsn scientificName commonNames nameUsage
#> <chr> <chr> <chr> <chr>
#> 1 19300 Quercus bicolor swamp white oak,chêne bicolore accepted
#> 2 195166 Quercus boyntonii Boynton's sand post oak,Boynton's oak accepted
```
## as.* methods
All `get_*()` functions have associated `as.*()` functions (e.g., `get_tsn()` and `as.tsn()`).
Many `taxize` functions use taxonomic identifier classes (S3 objects) that are the output
of `get_*()` functions. `as.*()` methods make it easy to make the required S3 taxonomic
identifier classes if you already know the identifier. For example:
Already a tsn, returns the same
```r
as.tsn(get_tsn("Quercus douglasii"))
#> ══ 1 queries ═══════════════
#> ✔ Found: Quercus douglasii
#> ══ Results ═════════════════
#>
#> ● Total: 1
#> ● Found: 1
#> ● Not Found: 0
#> [1] "19322"
#> attr(,"class")
#> [1] "tsn"
#> attr(,"match")
#> [1] "found"
#> attr(,"multiple_matches")
#> [1] FALSE
#> attr(,"pattern_match")
#> [1] FALSE
#> attr(,"uri")
#> [1] "https://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=19322"
```
numeric
```r
as.tsn(c(19322, 129313, 506198))
#> [1] "19322" "129313" "506198"
#> attr(,"class")
#> [1] "tsn"
#> attr(,"match")
#> [1] "found" "found" "found"
#> attr(,"multiple_matches")
#> [1] FALSE FALSE FALSE
#> attr(,"pattern_match")
#> [1] FALSE FALSE FALSE
#> attr(,"uri")
#> [1] "https://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=19322"
#> [2] "https://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=129313"
#> [3] "https://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=506198"
```
And you can do the same for character, or list inputs - depending on the data source.
The above `as.tsn()` examples have the parameter `check = TRUE`, meaning we ping the
data source web service to make sure the identifier exists. You can skip that check
if you like by setting `check = FALSE`, and the result is returned much faster:
```r
as.tsn(c("19322","129313","506198"), check = FALSE)
#> [1] "19322" "129313" "506198"
#> attr(,"class")
#> [1] "tsn"
#> attr(,"match")
#> [1] "found" "found" "found"
#> attr(,"multiple_matches")
#> [1] FALSE FALSE FALSE
#> attr(,"pattern_match")
#> [1] FALSE FALSE FALSE
#> attr(,"uri")
#> [1] "https://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=19322"
#> [2] "https://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=129313"
#> [3] "https://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=506198"
```
With the output of `as.*()` methods, you can then proceed with other `taxize` functions.
## gnr_resolve
Some functions in `taxize` are meant specifically for name cleaning. One of those
is `gnr_resolve()`.
`gnr_resolve()` doesn't provide prompts as do `get_*()` functions, but instead
return data.frame's. So we don't face the same problem, and can use `gnr_resolve()`
in a programmatic workflow straight away.
```r
spp <- names_list(rank = "species", size = 10)
gnr_resolve(spp, preferred_data_sources = 11)
#> # A tibble: 13 x 5
#> user_supplied_na… submitted_name matched_name data_source_tit… score
#> * <chr> <chr> <chr> <chr> <dbl>
#> 1 Astragalus radka… Astragalus radk… Astragalus radkane… GBIF Backbone T… 0.988
#> 2 Montanoa gigas Montanoa gigas Montanoa gigas Rze… GBIF Backbone T… 0.988
#> 3 Serratula semise… Serratula semis… Serratula semiserr… GBIF Backbone T… 0.988
#> 4 Serratula semise… Serratula semis… Serratula semiserr… GBIF Backbone T… 0.988
#> 5 Delosperma pagea… Delosperma page… Delosperma pageanu… GBIF Backbone T… 0.988
#> 6 Delosperma pagea… Delosperma page… Delosperma pageanu… GBIF Backbone T… 0.988
#> 7 Zieria hydroscop… Zieria hydrosco… Zieria hydroscopic… GBIF Backbone T… 0.988
#> 8 Baccharis flabel… Baccharis flabe… Baccharis flabella… GBIF Backbone T… 0.988
#> 9 Piper gonocarpum Piper gonocarpum Piper gonocarpum T… GBIF Backbone T… 0.988
#> 10 Lathraea japonica Lathraea japoni… Lathraea japonica … GBIF Backbone T… 0.988
#> 11 Lathraea japonica Lathraea japoni… Lathraea japonica … GBIF Backbone T… 0.988
#> 12 Lathraea japonica Lathraea japoni… Lathraea japonica … GBIF Backbone T… 0.988
#> 13 Verbesina tachir… Verbesina tachi… Verbesina tachiren… GBIF Backbone T… 0.988
```
## Other functions
Some other functions in `taxize` use `get_*()` functions internally (e.g., `classification()`),
but you can can generally pass on parameters to the `get_*()` functions internally.
## Feedback?
Let us know if you have ideas for better ways to do programmatic name cleaning at
https://github.com/ropensci/taxize/issues or https://discuss.ropensci.org/ !