Skip to content

Aliasing issue with := affecting a different column #5400

@aquasync

Description

@aquasync

The below triggers the bug for me (note the assignment to col2 changing the value of col1!):

library(data.table)

coalesce = function(x, ...) {
  for (y in list(...)) {
    idx = is.na(x)
    x[idx] = if (length(y) != 1) y[idx] else y
  }
  x
}

dt = data.table(id=1:64, col1=0, col2=0)
print(dt[1, .(col1, col2)])
#    col1 col2
# 1:    0    0
dt[, col1 := coalesce(col2, 111)]
dt[, col2 := 999]
print(dt[1, .(col1, col2)])
#    col1 col2
# 1:  999  999

And my sessionInfo() output:

R version 4.0.5 (2021-03-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252    LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
[5] LC_TIME=English_Australia.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.14.2

loaded via a namespace (and not attached):
[1] compiler_4.0.5 tools_4.0.5   

Basically it looks like col1 and col2 end up pointing at the same vector such that := modifies them both; I'm guessing they are shared but the reference counts are off such that := thinks it is safe to modify in-place. Not 100% clear to me if the actual underlying bug may be base R or data.table.

When trying to put together a minimal repro, I noticed a few different changes that make this bug disappear:

  • Simply printing the data table between the col1 and col2 assignments makes the issue go away.

  • It only manifests where the number of rows is at least 64. Perhaps that is used as a threshold at which some sort of copy-on-write optimization logic is kicking in somewhere?

  • Also the problem seems to be related to the coalesce function used here, despite it not having any effect in this example. Eg replacing it with coalesce = function(x, ...) x avoids any issue. It seems as though base r is doing something weird with [<- with an all false logical subset; maybe the result is the same object but no longer marked as shared? Note that assigning to col1 after coalesce does not affect col2, only vice-versa. Alternatively returning x[] in coalesce bypasses the erroneous sharing by forcing a copy or bumping the ref count.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions