-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Functions not handling accented chars properly #9
Comments
On the PC, the issue with function Current: ...<snip code>...
vect <- strsplit(vect, "", fixed = TRUE)
vect <- cpp_get_char_ngrams(vect, numgram = numgram)
vect <- iconv(vect, to = "ASCII//TRANSLIT") Updated: ...<snip code>...
vect <- iconv(vect, to = "ASCII//TRANSLIT")
vect <- strsplit(vect, "", fixed = TRUE)
vect <- cpp_get_char_ngrams(vect, numgram = numgram) |
The purpose of using vect <- c("César Moreira Nuñez", "cesar moreira nunez")
iconv(vect, to = "ASCII//TRANSLIT")
#> "C'esar Moreira Nu~nez" "cesar moreira nunez" I could use stringi::stri_trans_general(vect, "Latin-ASCII")
#> "Cesar Moreira Nunez" "cesar moreira nunez" |
Fixed in commit 3c0625b. |
Testing this on a Mac and a PC and getting different results.
On the PC:
On the Mac:
The expected output for all four functions above is
c("César Moreira Nuñez", "César Moreira Nuñez")
.This issue is possibly related to issue #58 from the rOpenSci pkg tokenizers (and the reprex above was stolen from that issue).
Both the Mac and PC are running R v3.4.4, and here's the local and encoding setting for each:
PC:
Mac:
The text was updated successfully, but these errors were encountered: