Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge() result incorrect (NAs) when by column contains non-ASCII characters #2072

Open
madlogos opened this issue Mar 22, 2017 · 5 comments
Open
Labels

Comments

@madlogos
Copy link

When the by column contains non-ASCII characters (e.g, Chinese), the merge() function returns NA unexpectedly.

d1=data.table(a=c("你", "我", "a", "他"), b=1:4)
d2=data.table(a=c("我", "他", "a"), c=3:5)
merge(d1, d2, by="a", all.x=TRUE, sort=FALSE)

should return

   a b  c
1 你 1 NA
2 我 2  3
3  a 3  5
4 他 4  4

right?

but actually turns out to be:

    a b  c
1: 你 1 NA
2: 我 2  3
3:  a 3  5
4: 他 4 NA

Could you look into this problem?

> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252       LC_CTYPE=Chinese (Simplified)_People's Republic of China.936
[3] LC_MONETARY=English_United States.1252   LC_NUMERIC=C                                                
[5] LC_TIME=English_United States.1252                          

attached base packages:
[1] compiler  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] data.table_1.10.4  
@madlogos
Copy link
Author

The function works well under Linux.
If I use merge(as.data.frame(d1), as.data.frame(d2), by="a", all.x=TRUE, sort=FALSE), then I can also get the expected result.
I guess it is caused by encoding problems in Windows platform.

@arunsrinivasan
Copy link
Member

Confirmed on Windows 10 under R-GUI, DT v1.10.4.

@jangorecki
Copy link
Member

I would say to best way to go is to use ascii column names, and if non-ascii names are needed, use labels instead. That will be covered by #623.

@shrektan
Copy link
Member

@jangorecki No, this issue has nothing to do with non-ascii colnames. It's about the non-ascii values and has been fixed already.

I just confirm that on Windows 10 with the dev version, the result is expected.

        a     b     c
   <char> <int> <int>
1:     你     1    NA
2:     我     2     3
3:      a     3     5
4:     他     4     4

We should close this issue.

@jangorecki
Copy link
Member

Uh, OK sorry then. Thanks for checking.
We should try to close it by providing unit test. AFAIK it is possible to have non-ascii in tests, but has to be encoded properly, so requires to map chinesse letters to unicode chars like \u000.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants