Multibytes colnames in non-native encoding cause errors in complex j expression with keyby #3722

shrektan · 2019-07-23T03:40:52Z

The below simple example illustrates the issue quite well. I believe it's a specific issue on Windows.

library(data.table)

tbl <- data.table(
  汉语 = 1,
  中文 = 2
)
Encoding(colnames(tbl))
# [1] "unknown" "unknown"
tbl[, .(a = sum(汉语)), keyby = 中文]
#     中文 a
#  1:    2 1
tbl[, .(a = sum(sort(汉语))), keyby = 中文]
#     中文 a
#  1:    2 1

setnames(tbl, colnames(tbl), enc2utf8(colnames(tbl)))
Encoding(colnames(tbl))
# [1] "UTF-8" "UTF-8"
tbl[, .(a = sum(汉语)), keyby = 中文]
#     中文 a
#  1:    2 1
tbl[, .(a = sum(sort(汉语))), keyby = 中文]
# Error in sort(汉语) : object '汉语' not found

session info

> sessionInfo() R version 3.4.4 (2018-03-15) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936
[2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
[4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] data.table_1.12.3 usethis_1.4.0

loaded via a namespace (and not attached):
[1] compiler_3.4.4 tools_3.4.4 fs_1.2.7 glue_1.3.1 yaml_2.2.0 Rcpp_1.0.1

MichaelChirico · 2019-07-24T02:11:47Z

Works on my Mac and Linux, I guess it's a Windows thing?

shrektan · 2019-08-04T06:55:01Z

An example that works on Mac or Linux, which uses the latin-1 encoding:

library(data.table)
utf8 = c("\u00e7ile", "\u00de")
latin1 = iconv(utf8, from = "UTF-8", to = "latin1")
tbl <- as.data.table(setNames(list(1, 2), utf8))
tbl[]
#>    çile Þ
#> 1:    1 2
Encoding(colnames(tbl))
#> [1] "UTF-8" "UTF-8"
tbl[, .(a = sum(`çile`)), keyby = `Þ`]
#>    Þ a
#> 1: 2 1
tbl[, .(a = sum(sort(`çile`))), keyby = `Þ`]
#>    Þ a
#> 1: 2 1
setnames(tbl, colnames(tbl), latin1)
Encoding(colnames(tbl))
#> [1] "latin1" "latin1"
tbl[, .(a = sum(`çile`)), keyby = `Þ`]
#>    Þ a
#> 1: 2 1
tbl[, .(a = sum(sort(`çile`))), keyby = `Þ`]
#> Error in sort(çile): object 'çile' not found

^{Created on 2019-08-04 by the reprex package (v0.2.1)}

shrektan · 2019-08-04T09:51:01Z

The error is thrown from

data.table/src/dogroups.c

Line 258 in a8e0230

PROTECT(jval = eval(jexp, env));

I don't know how to debug C code that involves R's language or environment type. Fail to find them on R-internals or R-ext... If anybody can share me how to print out the info related to those objects, it would be very much appreciated.

jangorecki · 2019-08-04T10:31:30Z

@shrektan maybe just modify jexp that you are passing from R to include debug cat calls? then those will be printed during eval invoked from C.

MichaelChirico · 2019-08-17T02:44:31Z

Possibly related (though I think not): #1726

MichaelChirico · 2019-09-05T03:23:56Z

Difference on these two is GForce:

tbl[, .(a = sum(汉语)), keyby = 中文] # GForced
tbl[, .(a = sum(sort(汉语))), keyby = 中文] # not

Are you sure keyby is part of the issue? I would be surprised

MichaelChirico · 2019-09-05T03:25:42Z

The Mac/Linux example is also not working for me (Mac)

@shrektan along Jan's suggestion, try running this?

tbl[, .(a = {print(ls()); print(names(.SD)); print(.BY); sum(汉语)}), by = 中文]

shrektan · 2019-09-05T16:12:58Z

@MichaelChirico

So you are saying you can't reproduce the blow code on macOS?

library(data.table)
utf8 = c("\u00e7ile", "\u00de")
latin1 = iconv(utf8, from = "UTF-8", to = "latin1")
tbl <- as.data.table(setNames(list(1, 2), latin1))
tbl[, .(a = sum(`çile`)), keyby = `Þ`]
#>    Þ a
#> 1: 2 1
tbl[, .(a = sum(sort(`çile`))), keyby = `Þ`]
#> Error in sort(çile): object 'çile' not found
tbl[, .(a = {print(ls()); print(names(.SD)); print(Encoding(names(.SD))); print(.BY);  sum(sort(`çile`))}), keyby = `Þ`]
#> [1] "\xe7ile"   "Cfastmean" "print"     "strptime"  "Þ"        
#> [1] "çile"
#> [1] "latin1"
#> $Þ
#> [1] 2
#> Error in sort(çile): object 'çile' not found

^{Created on 2019-09-06 by the reprex package (v0.3.0)}

Session info

devtools::session_info()
#> ─ Session info ──────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 3.6.1 (2019-07-05)
#>  os       macOS Mojave 10.14.5        
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       Asia/Shanghai               
#>  date     2019-09-06                  
#> 
#> ─ Packages ──────────────────────────────────────────────────────────────
#>  package     * version date       lib source        
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.6.0)
#>  backports     1.1.4   2019-04-10 [1] CRAN (R 3.6.0)
#>  callr         3.3.1   2019-07-18 [1] CRAN (R 3.6.0)
#>  cli           1.1.0   2019-03-19 [1] CRAN (R 3.6.0)
#>  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.6.0)
#>  data.table  * 1.12.3  2019-08-24 [1] local         
#>  desc          1.2.0   2018-05-01 [1] CRAN (R 3.6.0)
#>  devtools      2.1.0   2019-07-06 [1] CRAN (R 3.6.0)
#>  digest        0.6.20  2019-07-04 [1] CRAN (R 3.6.0)
#>  evaluate      0.14    2019-05-28 [1] CRAN (R 3.6.0)
#>  fs            1.3.1   2019-05-06 [1] CRAN (R 3.6.0)
#>  glue          1.3.1   2019-03-12 [1] CRAN (R 3.6.0)
#>  highr         0.8     2019-03-20 [1] CRAN (R 3.6.0)
#>  htmltools     0.3.6   2017-04-28 [1] CRAN (R 3.6.0)
#>  knitr         1.24    2019-08-08 [1] CRAN (R 3.6.0)
#>  magrittr      1.5     2014-11-22 [1] CRAN (R 3.6.0)
#>  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.6.0)
#>  pkgbuild      1.0.4   2019-08-05 [1] CRAN (R 3.6.0)
#>  pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.6.0)
#>  prettyunits   1.0.2   2015-07-13 [1] CRAN (R 3.6.0)
#>  processx      3.4.1   2019-07-18 [1] CRAN (R 3.6.0)
#>  ps            1.3.0   2018-12-21 [1] CRAN (R 3.6.0)
#>  R6            2.4.0   2019-02-14 [1] CRAN (R 3.6.0)
#>  Rcpp          1.0.2   2019-07-25 [1] CRAN (R 3.6.0)
#>  remotes       2.1.0   2019-06-24 [1] CRAN (R 3.6.0)
#>  rlang         0.4.0   2019-06-25 [1] CRAN (R 3.6.0)
#>  rmarkdown     1.15    2019-08-21 [1] CRAN (R 3.6.0)
#>  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.6.0)
#>  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.6.0)
#>  stringi       1.4.3   2019-03-12 [1] CRAN (R 3.6.0)
#>  stringr       1.4.0   2019-02-10 [1] CRAN (R 3.6.0)
#>  testthat      2.2.1   2019-07-25 [1] CRAN (R 3.6.0)
#>  usethis       1.5.1   2019-07-04 [1] CRAN (R 3.6.0)
#>  withr         2.1.2   2018-03-15 [1] CRAN (R 3.6.0)
#>  xfun          0.9     2019-08-21 [1] CRAN (R 3.6.0)
#>  yaml          2.2.0   2018-07-25 [1] CRAN (R 3.6.0)
#> 
#> [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library

shrektan · 2019-09-05T16:37:09Z

I see, the first ls() element "\xe7ile" is the latin-1 presentation of the UTF-8 string "çile". (iconv("\xe7ile", 'latin1', 'UTF-8') returns "çile")
So when the expression is evaluated in the environment, R will complain the later can't be found.

I think R will always parse the quotation into the parse tree with native encoding (see the code below). If the column name of the data.table is not native encoded and is evaluated via Rf_eval(), the error happens.

env <- new.env(parent = emptyenv())
utf8 = c("\u00e7ile", "\u00de")
latin1 = iconv(utf8, from = "UTF-8", to = "latin1")
assign(latin1[1], 1, pos = env)
ls(env)
#> [1] "çile"
Encoding(ls(env))
#> [1] "unknown"
assign(utf8[1], 1, pos = env)
ls(env)
#> [1] "çile"
Encoding(ls(env))
#> [1] "unknown"

^{Created on 2019-09-06 by the reprex package (v0.3.0)}

In conclusion, the fix should be when preparing the env object, we should convert the names of the variables to the native encoded strings first...

MichaelChirico · 2019-09-05T22:54:25Z

Indeed I do get the error! Dunno what was happening last time.

@shrektan here is what I guess are the guilty lines with assign, can you take it from here?

data.table/R/data.table.R

Line 1234 in 27f7516

for (ii in ansvars) assign(ii, SDenv$.SDall[[ii]], SDenv)

shrektan added bug platform-specific labels Jul 23, 2019

shrektan self-assigned this Jul 23, 2019

shrektan changed the title ~~[BUG] UTF-8 ColName causes error in complex j expression with keyby~~ UTF-8 Colnames cause error in complex j expression with keyby Jul 23, 2019

shrektan changed the title ~~UTF-8 Colnames cause error in complex j expression with keyby~~ Multibytes colnames in non-native encoding cause errors in complex j expression with keyby Aug 4, 2019

shrektan added the encoding issues related to Encoding label Sep 9, 2019

shrektan mentioned this issue Apr 6, 2020

non-ascii tests #4351

Draft

shrektan mentioned this issue Dec 21, 2020

UTF encoding in .SDcols causes error when using 'by' argument if there are any accents. #4856

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multibytes colnames in non-native encoding cause errors in complex j expression with keyby #3722

Multibytes colnames in non-native encoding cause errors in complex j expression with keyby #3722

shrektan commented Jul 23, 2019 •

edited

Loading

MichaelChirico commented Jul 24, 2019

shrektan commented Aug 4, 2019 •

edited

Loading

shrektan commented Aug 4, 2019 •

edited

Loading

jangorecki commented Aug 4, 2019

MichaelChirico commented Aug 17, 2019

MichaelChirico commented Sep 5, 2019

MichaelChirico commented Sep 5, 2019

shrektan commented Sep 5, 2019

shrektan commented Sep 5, 2019 •

edited

Loading

MichaelChirico commented Sep 5, 2019

Multibytes colnames in non-native encoding cause errors in complex j expression with keyby #3722

Multibytes colnames in non-native encoding cause errors in complex j expression with keyby #3722

Comments

shrektan commented Jul 23, 2019 • edited Loading

MichaelChirico commented Jul 24, 2019

shrektan commented Aug 4, 2019 • edited Loading

shrektan commented Aug 4, 2019 • edited Loading

jangorecki commented Aug 4, 2019

MichaelChirico commented Aug 17, 2019

MichaelChirico commented Sep 5, 2019

MichaelChirico commented Sep 5, 2019

shrektan commented Sep 5, 2019

shrektan commented Sep 5, 2019 • edited Loading

MichaelChirico commented Sep 5, 2019

shrektan commented Jul 23, 2019 •

edited

Loading

shrektan commented Aug 4, 2019 •

edited

Loading

shrektan commented Aug 4, 2019 •

edited

Loading

shrektan commented Sep 5, 2019 •

edited

Loading