Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected missing matches with non-equi join with grouping by .EACHI #4911

Closed
adamaltmejd opened this issue Feb 19, 2021 · 5 comments · Fixed by #4917
Closed

Unexpected missing matches with non-equi join with grouping by .EACHI #4911

adamaltmejd opened this issue Feb 19, 2021 · 5 comments · Fixed by #4917
Assignees
Labels
Milestone

Comments

@adamaltmejd
Copy link

Working with proprietary data so was a bit tricky creating a reproducible example but think this works.

X <- setDT(structure(list(id = c(6456372L, 6456372L, 6456372L, 6456372L, 
6456372L, 6456372L, 6456372L, 6456372L, 6456372L, 6456372L, 6456372L, 
6456372L, 6456372L, 6456372L), id_round = c(197801L, 199405L, 
199501L, 197901L, 197905L, 198001L, 198005L, 198101L, 198105L, 
198201L, 198205L, 198301L, 198305L, 198401L), field = c(NA, NA, 
NA, "medicine", "medicine", "medicine", "medicine", "medicine", 
"medicine", "medicine", "medicine", "medicine", "medicine", "medicine"
)), class = c("data.table", "data.frame"
), sorted = "id"))

Y <- setDT(structure(list(id = c(6456372L, 6456345L, 6456356L), id_round = c(197705L, 
197905L, 201705L), field = c("medicine", "teaching", "health"
), prio = c(6L, 1L, 10L)), class = c("data.table", 
"data.frame"), sorted = c("id_round", 
"id", "prio", "field")))

X[Y, on = .(id, id_round > id_round, field), .(x.id_round[1], i.id_round[1]), by = .EACHI]
id id_round    field     V1     V2
1: 6456372   197705 medicine 197901 197705
2: 6456345   197905 teaching     NA 197905
3: 6456356   201705   health     NA 201705

So everything seems to work fine, but these results are supposed to be merged back into the main data set Y and here is where I run in to trouble. It does not merge and moreover I cannot subset by id anymore:

> X[Y, on = .(id, id_round > id_round, field), .(x.id_round[1], i.id_round[1]), by = .EACHI][id == 6456372]              
Empty data.table (0 rows and 5 cols): id,id_round,field,V1,V2

Expecting to find a match here of course. The strange thing is that it works if I drop by=.EACHI or if I drop the last key column "prio":

> X[Y, on = .(id, id_round > id_round, field), .(id, field, x.id_round[1], i.id_round[1])][id == 6456372]                
         id    field     V3     V4
 1: 6456372 medicine 197901 197705
 2: 6456372 medicine 197901 197705
 3: 6456372 medicine 197901 197705
 4: 6456372 medicine 197901 197705
 5: 6456372 medicine 197901 197705
 6: 6456372 medicine 197901 197705
 7: 6456372 medicine 197901 197705
 8: 6456372 medicine 197901 197705
 9: 6456372 medicine 197901 197705
10: 6456372 medicine 197901 197705
11: 6456372 medicine 197901 197705
> X[Y[, .(id, id_round, field)], on = .(id, id_round > id_round, field), .(x.id_round[1], i.id_round[1]), by = .EACHI][id == 6456372]                                                                                                             
        id id_round    field     V1     V2
1: 6456372   197705 medicine 197901 197705

Y is keyed by "prio" but it is not included in the join. It seems to be related to the id number's relation to the other numbers, cause if I change the number to 6456344 or anything lower I get the expected results.

Running latest dev:

> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8      
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.13.7 colorout_1.2-2   

loaded via a namespace (and not attached):
[1] compiler_4.0.4 jsonlite_1.7.2 rlang_0.4.10  
@myoung3
Copy link
Contributor

myoung3 commented Feb 19, 2021

reproduced on dev data.table in R 4.0.3 (windows x86)

library(data.table)
X <- as.data.table(as.data.frame(structure(list(id = c(6456372L, 6456372L, 6456372L, 6456372L, 
                                 6456372L, 6456372L, 6456372L, 6456372L, 6456372L, 6456372L, 6456372L, 
                                 6456372L, 6456372L, 6456372L), id_round = c(197801L, 199405L, 
                                                                             199501L, 197901L, 197905L, 198001L, 198005L, 198101L, 198105L, 
                                                                             198201L, 198205L, 198301L, 198305L, 198401L), field = c(NA, NA, 
                                                                                                                                     NA, "medicine", "medicine", "medicine", "medicine", "medicine", 
                                                                                                                                     "medicine", "medicine", "medicine", "medicine", "medicine", "medicine"
                                                                             )), class = c("data.table", "data.frame"
                                                                             ), sorted = "id")))

Y <- as.data.table(as.data.frame(structure(list(id = c(6456372L, 6456345L, 6456356L), id_round = c(197705L, 
                                                                             197905L, 201705L), field = c("medicine", "teaching", "health"
                                                                             ), prio = c(6L, 1L, 10L)), class = c("data.table", 
                                                                                                                  "data.frame"), sorted = c("id_round", 
                                                                                                                                            "id", "prio", "field"))))

X[Y, on = .(id, id_round > id_round, field), .(x.id_round[1], i.id_round[1]), by = .EACHI]
#>         id id_round    field     V1     V2
#> 1: 6456372   197705 medicine 197901 197705
#> 2: 6456345   197905 teaching     NA 197905
#> 3: 6456356   201705   health     NA 201705

temp <- X[Y, on = .(id, id_round > id_round, field), .(x.id_round[1], i.id_round[1]), by = .EACHI]
temp[id == 6456372]  
#>         id id_round    field     V1     V2
#> 1: 6456372   197705 medicine 197901 197705
as.data.table(as.data.frame(temp))[id == 6456372]
#>         id id_round    field     V1     V2
#> 1: 6456372   197705 medicine 197901 197705

Created on 2021-02-19 by the reprex package (v0.3.0)

@myoung3
Copy link
Contributor

myoung3 commented Feb 19, 2021

coercing X and Y to data.frame then back to data.table fixes this, so this is almost certainly an issue introduced in setDT

library(data.table)
X <- as.data.table(as.data.frame(structure(list(id = c(6456372L, 6456372L, 6456372L, 6456372L, 
                                 6456372L, 6456372L, 6456372L, 6456372L, 6456372L, 6456372L, 6456372L, 
                                 6456372L, 6456372L, 6456372L), id_round = c(197801L, 199405L, 
                                                                             199501L, 197901L, 197905L, 198001L, 198005L, 198101L, 198105L, 
                                                                             198201L, 198205L, 198301L, 198305L, 198401L), field = c(NA, NA, 
                                                                                                                                     NA, "medicine", "medicine", "medicine", "medicine", "medicine", 
                                                                                                                                     "medicine", "medicine", "medicine", "medicine", "medicine", "medicine"
                                                                             )), class = c("data.table", "data.frame"
                                                                             ), sorted = "id")))

Y <- as.data.table(as.data.frame(structure(list(id = c(6456372L, 6456345L, 6456356L), id_round = c(197705L, 
                                                                             197905L, 201705L), field = c("medicine", "teaching", "health"
                                                                             ), prio = c(6L, 1L, 10L)), class = c("data.table", 
                                                                                                                  "data.frame"), sorted = c("id_round", 
                                                                                                                                            "id", "prio", "field"))))

X[Y, on = .(id, id_round > id_round, field), .(x.id_round[1], i.id_round[1]), by = .EACHI]
#>         id id_round    field     V1     V2
#> 1: 6456372   197705 medicine 197901 197705
#> 2: 6456345   197905 teaching     NA 197905
#> 3: 6456356   201705   health     NA 201705

temp <- X[Y, on = .(id, id_round > id_round, field), .(x.id_round[1], i.id_round[1]), by = .EACHI]
temp[id == 6456372]  
#>         id id_round    field     V1     V2
#> 1: 6456372   197705 medicine 197901 197705
as.data.table(as.data.frame(temp))[id == 6456372]
#>         id id_round    field     V1     V2
#> 1: 6456372   197705 medicine 197901 197705

Created on 2021-02-19 by the reprex package (v0.3.0)

@ColeMiller1
Copy link
Contributor

This is a duplicate of #4603 . Let's keep it open and thanks for the report.

The problem is that data.table incorrectly assigns a key attribute to the resulting join. The problem is that the result isn't sorted:

X[Y, on = .(id, id_round > id_round, field), .(x.id_round[1], i.id_round[1]), by = .EACHI][, key(.SD)]
## [1] "id"       "id_round" "field"   

## let's verify id is sorted... it's not!
X[Y, on = .(id, id_round > id_round, field), .(x.id_round[1], i.id_round[1]), by = .EACHI]
##         id id_round    field     V1     V2
 ##     <int>    <int>   <char>  <int>  <int>
## 1: 6456372   197705 medicine 197901 197705
## 2: 6456345   197905 teaching     NA 197905
## 3: 6456356   201705   health     NA 201705

## data.table uses binary merge because this subset is optimized. Because it thinks id is sorted, there are incorrect results.
X[Y, on = .(id, id_round > id_round, field), .(x.id_round[1], i.id_round[1]), by = .EACHI][id == 6456372L, verbose = TRUE] 

## Optimized subsetting with key 'id'
## forder.c received 1 rows and 1 columns
## forder took 0 sec
## x is already ordered by these columns, no need to call reorder
## i.id has same type (integer) as x.id. No coercion needed.
## on= matches existing key, using key
## Starting bmerge ...
## bmerge done in 0.000s elapsed (0.000s cpu) 
## Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu) 

## Empty data.table (0 rows and 5 cols): id,id_round,field,V1,V2

I do believe I have something on my computer. Let me dust it off and submit something this weekend.

@ColeMiller1 ColeMiller1 self-assigned this Feb 20, 2021
@myoung3
Copy link
Contributor

myoung3 commented Feb 20, 2021

Well that explains why this seemed familiar.

@adamaltmejd
Copy link
Author

Thanks!!

@mattdowle mattdowle added this to the 1.14.1 milestone Mar 17, 2021
@jangorecki jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants