Skip to content

Parenthesis inside the string: plantR::prepName #132

Open
@ggrittz

Description

@ggrittz

plantR::prepName cannot deal with cases such as "Sobrinho, J. de P.L. (no. 1441)"

> plantR::prepName('Sobrinho, J. de P.L. (no. 1441)')
Error in gsub(x, "", y, perl = TRUE) : 
  expressão regular inválida ')|Sobrinho'
Além disso: Warning message:
In gsub(x, "", y, perl = TRUE) : erro de compilação de padrão PCRE
	'unmatched closing parenthesis'
	at ')|Sobrinho'

This is because when a parenthesis (or bracket) is found, the function only tracks them if they are at the beginning and the end, i.e., "(João Silva)":

Below are lines 11 to 18 of prepName

if (any(bracks)) 
    x[bracks] <- gsub("^\\[|\\]$|^\\(|\\)$", "", x[bracks], 
                      perl = TRUE)
  parent <- grepl("^\\(", x, perl = TRUE) & grepl("\\)$", x, 
                                                  perl = TRUE)
  if (any(parent)) 
    x[parent] <- gsub("^\\[|\\]$|^\\(|\\)$", "", x[parent], 
                      perl = TRUE)

Cases such as "Sobrinho, J. de P.L. (no. 1441)" are not accounted for and an error is returned. I've been thinking about how to solve this, since at the end of the function those brackets and parenthesis are returned, but for cases like the one I mentioned, this exercise becomes too complicated. So I looked at thousands of cases like this and pretty much all of them are one of

  1. some location, ("Parc National de Port-Cros)"
  2. some institute name, "(INFLOVAR (Association))"
  3. another name, "Franklin, M.A. (Ben)" or
  4. a (potential) collector number, "Luetzelburg, P. von (no. 23045)"

But since collector numbers are not extracted from these columns (prepName is used on $recordedBy and $identifiedBy), I think in cases such as "Sobrinho, J. de P.L. (no. 1441)", everything inside the parentheses (including them) could be removed. The function preps a name only, and what's inside the within-string parenthesis is not used for anything else. Also, if only the parentheses are removed, i.e., "Sobrinho, J. de P.L. no. 1441", then the output gets messy and considers "No." the surname.

If wanted, to remove the within-string parentheses and what's inside, it's possible to use

x <- trimws(ifelse(grepl("(?<!^)\\(", x, perl = TRUE) | grepl("\\)(?!$)", x, perl = TRUE), gsub("\\([^)]*\\)", "", x), x))
It will still keep cases such as "(João Silva)" as is.

The lines above could be added just before this step (line 14) in the prepName function:
parent <- grepl("^\\(", x, perl = TRUE) & grepl("\\)$", x, perl = TRUE)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions