merge() result incorrect (NAs) when `by` column contains non-ASCII characters #2072

madlogos · 2017-03-22T04:30:10Z

When the by column contains non-ASCII characters (e.g, Chinese), the merge() function returns NA unexpectedly.

d1=data.table(a=c("你", "我", "a", "他"), b=1:4)
d2=data.table(a=c("我", "他", "a"), c=3:5)
merge(d1, d2, by="a", all.x=TRUE, sort=FALSE)

should return

right?

but actually turns out to be:

    a b  c
1: 你 1 NA
2: 我 2  3
3:  a 3  5
4: 他 4 NA

Could you look into this problem?

> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252       LC_CTYPE=Chinese (Simplified)_People's Republic of China.936
[3] LC_MONETARY=English_United States.1252   LC_NUMERIC=C                                                
[5] LC_TIME=English_United States.1252                          

attached base packages:
[1] compiler  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] data.table_1.10.4

The text was updated successfully, but these errors were encountered:

madlogos · 2017-03-27T08:57:40Z

The function works well under Linux.
If I use merge(as.data.frame(d1), as.data.frame(d2), by="a", all.x=TRUE, sort=FALSE), then I can also get the expected result.
I guess it is caused by encoding problems in Windows platform.

arunsrinivasan · 2017-03-30T22:21:11Z

Confirmed on Windows 10 under R-GUI, DT v1.10.4.

jangorecki · 2020-05-23T11:53:12Z

I would say to best way to go is to use ascii column names, and if non-ascii names are needed, use labels instead. That will be covered by #623.

shrektan · 2020-05-23T14:48:31Z

@jangorecki No, this issue has nothing to do with non-ascii colnames. It's about the non-ascii values and has been fixed already.

I just confirm that on Windows 10 with the dev version, the result is expected.

        a     b     c
   <char> <int> <int>
1:     你     1    NA
2:     我     2     3
3:      a     3     5
4:     他     4     4

We should close this issue.

jangorecki · 2020-05-23T15:10:46Z

Uh, OK sorry then. Thanks for checking.
We should try to close it by providing unit test. AFAIK it is possible to have non-ascii in tests, but has to be encoded properly, so requires to map chinesse letters to unicode chars like \u000.

arunsrinivasan added the bug label Mar 30, 2017

jangorecki added encoding issues related to Encoding joins labels Apr 6, 2020

jangorecki mentioned this issue Apr 6, 2020

Add is.valid(DT) function #2334

Open

jangorecki added the platform-specific label May 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge() result incorrect (NAs) when `by` column contains non-ASCII characters #2072

merge() result incorrect (NAs) when `by` column contains non-ASCII characters #2072

madlogos commented Mar 22, 2017

madlogos commented Mar 27, 2017

arunsrinivasan commented Mar 30, 2017

jangorecki commented May 23, 2020

shrektan commented May 23, 2020

jangorecki commented May 23, 2020

merge() result incorrect (NAs) when by column contains non-ASCII characters #2072

merge() result incorrect (NAs) when by column contains non-ASCII characters #2072

Comments

madlogos commented Mar 22, 2017

madlogos commented Mar 27, 2017

arunsrinivasan commented Mar 30, 2017

jangorecki commented May 23, 2020

shrektan commented May 23, 2020

jangorecki commented May 23, 2020

merge() result incorrect (NAs) when `by` column contains non-ASCII characters #2072

merge() result incorrect (NAs) when `by` column contains non-ASCII characters #2072