Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multibytes colnames in non-native encoding cause errors in complex j expression with keyby #3722

Open
shrektan opened this issue Jul 23, 2019 · 10 comments
Assignees
Labels
bug encoding issues related to Encoding platform-specific

Comments

@shrektan
Copy link
Member

shrektan commented Jul 23, 2019

The below simple example illustrates the issue quite well. I believe it's a specific issue on Windows.

library(data.table)

tbl <- data.table(
  汉语 = 1,
  中文 = 2
)
Encoding(colnames(tbl))
# [1] "unknown" "unknown"
tbl[, .(a = sum(汉语)), keyby = 中文]
#     中文 a
#  1:    2 1
tbl[, .(a = sum(sort(汉语))), keyby = 中文]
#     中文 a
#  1:    2 1

setnames(tbl, colnames(tbl), enc2utf8(colnames(tbl)))
Encoding(colnames(tbl))
# [1] "UTF-8" "UTF-8"
tbl[, .(a = sum(汉语)), keyby = 中文]
#     中文 a
#  1:    2 1
tbl[, .(a = sum(sort(汉语))), keyby = 中文]
# Error in sort(汉语) : object '汉语' not found
session info > sessionInfo() R version 3.4.4 (2018-03-15) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936
[2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
[4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] data.table_1.12.3 usethis_1.4.0

loaded via a namespace (and not attached):
[1] compiler_3.4.4 tools_3.4.4 fs_1.2.7 glue_1.3.1 yaml_2.2.0 Rcpp_1.0.1

@shrektan shrektan self-assigned this Jul 23, 2019
@shrektan shrektan changed the title [BUG] UTF-8 ColName causes error in complex j expression with keyby UTF-8 Colnames cause error in complex j expression with keyby Jul 23, 2019
@MichaelChirico
Copy link
Member

Works on my Mac and Linux, I guess it's a Windows thing?

@shrektan
Copy link
Member Author

shrektan commented Aug 4, 2019

An example that works on Mac or Linux, which uses the latin-1 encoding:

library(data.table)
utf8 = c("\u00e7ile", "\u00de")
latin1 = iconv(utf8, from = "UTF-8", to = "latin1")
tbl <- as.data.table(setNames(list(1, 2), utf8))
tbl[]
#>    çile Þ
#> 1:    1 2
Encoding(colnames(tbl))
#> [1] "UTF-8" "UTF-8"
tbl[, .(a = sum(`çile`)), keyby = `Þ`]
#>    Þ a
#> 1: 2 1
tbl[, .(a = sum(sort(`çile`))), keyby = `Þ`]
#>    Þ a
#> 1: 2 1
setnames(tbl, colnames(tbl), latin1)
Encoding(colnames(tbl))
#> [1] "latin1" "latin1"
tbl[, .(a = sum(`çile`)), keyby = `Þ`]
#>    Þ a
#> 1: 2 1
tbl[, .(a = sum(sort(`çile`))), keyby = `Þ`]
#> Error in sort(çile): object 'çile' not found

Created on 2019-08-04 by the reprex package (v0.2.1)

@shrektan shrektan changed the title UTF-8 Colnames cause error in complex j expression with keyby Multibytes colnames in non-native encoding cause errors in complex j expression with keyby Aug 4, 2019
@shrektan
Copy link
Member Author

shrektan commented Aug 4, 2019

The error is thrown from

PROTECT(jval = eval(jexp, env));

I don't know how to debug C code that involves R's language or environment type. Fail to find them on R-internals or R-ext... If anybody can share me how to print out the info related to those objects, it would be very much appreciated.

@jangorecki
Copy link
Member

@shrektan maybe just modify jexp that you are passing from R to include debug cat calls? then those will be printed during eval invoked from C.

@MichaelChirico
Copy link
Member

Possibly related (though I think not): #1726

@MichaelChirico
Copy link
Member

Difference on these two is GForce:

tbl[, .(a = sum(汉语)), keyby = 中文] # GForced
tbl[, .(a = sum(sort(汉语))), keyby = 中文] # not

Are you sure keyby is part of the issue? I would be surprised

@MichaelChirico
Copy link
Member

The Mac/Linux example is also not working for me (Mac)

@shrektan along Jan's suggestion, try running this?

tbl[, .(a = {print(ls()); print(names(.SD)); print(.BY); sum(汉语)}), by = 中文]

@shrektan
Copy link
Member Author

shrektan commented Sep 5, 2019

@MichaelChirico

So you are saying you can't reproduce the blow code on macOS?

library(data.table)
utf8 = c("\u00e7ile", "\u00de")
latin1 = iconv(utf8, from = "UTF-8", to = "latin1")
tbl <- as.data.table(setNames(list(1, 2), latin1))
tbl[, .(a = sum(`çile`)), keyby = `Þ`]
#>    Þ a
#> 1: 2 1
tbl[, .(a = sum(sort(`çile`))), keyby = `Þ`]
#> Error in sort(çile): object 'çile' not found
tbl[, .(a = {print(ls()); print(names(.SD)); print(Encoding(names(.SD))); print(.BY);  sum(sort(`çile`))}), keyby = `Þ`]
#> [1] "\xe7ile"   "Cfastmean" "print"     "strptime"  "Þ"        
#> [1] "çile"
#> [1] "latin1"
#> $Þ
#> [1] 2
#> Error in sort(çile): object 'çile' not found

Created on 2019-09-06 by the reprex package (v0.3.0)

Session info
devtools::session_info()
#> ─ Session info ──────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 3.6.1 (2019-07-05)
#>  os       macOS Mojave 10.14.5        
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       Asia/Shanghai               
#>  date     2019-09-06                  
#> 
#> ─ Packages ──────────────────────────────────────────────────────────────
#>  package     * version date       lib source        
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.6.0)
#>  backports     1.1.4   2019-04-10 [1] CRAN (R 3.6.0)
#>  callr         3.3.1   2019-07-18 [1] CRAN (R 3.6.0)
#>  cli           1.1.0   2019-03-19 [1] CRAN (R 3.6.0)
#>  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.6.0)
#>  data.table  * 1.12.3  2019-08-24 [1] local         
#>  desc          1.2.0   2018-05-01 [1] CRAN (R 3.6.0)
#>  devtools      2.1.0   2019-07-06 [1] CRAN (R 3.6.0)
#>  digest        0.6.20  2019-07-04 [1] CRAN (R 3.6.0)
#>  evaluate      0.14    2019-05-28 [1] CRAN (R 3.6.0)
#>  fs            1.3.1   2019-05-06 [1] CRAN (R 3.6.0)
#>  glue          1.3.1   2019-03-12 [1] CRAN (R 3.6.0)
#>  highr         0.8     2019-03-20 [1] CRAN (R 3.6.0)
#>  htmltools     0.3.6   2017-04-28 [1] CRAN (R 3.6.0)
#>  knitr         1.24    2019-08-08 [1] CRAN (R 3.6.0)
#>  magrittr      1.5     2014-11-22 [1] CRAN (R 3.6.0)
#>  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.6.0)
#>  pkgbuild      1.0.4   2019-08-05 [1] CRAN (R 3.6.0)
#>  pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.6.0)
#>  prettyunits   1.0.2   2015-07-13 [1] CRAN (R 3.6.0)
#>  processx      3.4.1   2019-07-18 [1] CRAN (R 3.6.0)
#>  ps            1.3.0   2018-12-21 [1] CRAN (R 3.6.0)
#>  R6            2.4.0   2019-02-14 [1] CRAN (R 3.6.0)
#>  Rcpp          1.0.2   2019-07-25 [1] CRAN (R 3.6.0)
#>  remotes       2.1.0   2019-06-24 [1] CRAN (R 3.6.0)
#>  rlang         0.4.0   2019-06-25 [1] CRAN (R 3.6.0)
#>  rmarkdown     1.15    2019-08-21 [1] CRAN (R 3.6.0)
#>  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.6.0)
#>  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.6.0)
#>  stringi       1.4.3   2019-03-12 [1] CRAN (R 3.6.0)
#>  stringr       1.4.0   2019-02-10 [1] CRAN (R 3.6.0)
#>  testthat      2.2.1   2019-07-25 [1] CRAN (R 3.6.0)
#>  usethis       1.5.1   2019-07-04 [1] CRAN (R 3.6.0)
#>  withr         2.1.2   2018-03-15 [1] CRAN (R 3.6.0)
#>  xfun          0.9     2019-08-21 [1] CRAN (R 3.6.0)
#>  yaml          2.2.0   2018-07-25 [1] CRAN (R 3.6.0)
#> 
#> [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library

@shrektan
Copy link
Member Author

shrektan commented Sep 5, 2019

I see, the first ls() element "\xe7ile" is the latin-1 presentation of the UTF-8 string "çile". (iconv("\xe7ile", 'latin1', 'UTF-8') returns "çile")
So when the expression is evaluated in the environment, R will complain the later can't be found.

I think R will always parse the quotation into the parse tree with native encoding (see the code below). If the column name of the data.table is not native encoded and is evaluated via Rf_eval(), the error happens.

env <- new.env(parent = emptyenv())
utf8 = c("\u00e7ile", "\u00de")
latin1 = iconv(utf8, from = "UTF-8", to = "latin1")
assign(latin1[1], 1, pos = env)
ls(env)
#> [1] "çile"
Encoding(ls(env))
#> [1] "unknown"
assign(utf8[1], 1, pos = env)
ls(env)
#> [1] "çile"
Encoding(ls(env))
#> [1] "unknown"

Created on 2019-09-06 by the reprex package (v0.3.0)

In conclusion, the fix should be when preparing the env object, we should convert the names of the variables to the native encoded strings first...

@MichaelChirico
Copy link
Member

Indeed I do get the error! Dunno what was happening last time.

@shrektan here is what I guess are the guilty lines with assign, can you take it from here?

for (ii in ansvars) assign(ii, SDenv$.SDall[[ii]], SDenv)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug encoding issues related to Encoding platform-specific
Projects
None yet
Development

No branches or pull requests

3 participants