Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-Forge #2590] During grouping, last group overwrites memory used by previous groups #507

Open
arunsrinivasan opened this issue Jun 8, 2014 · 5 comments
Labels

Comments

@arunsrinivasan
Copy link
Member

Submitted by: Matt Dowle; Assigned to: Nobody; R-Forge link

As raised here on SO. I had a look and made progress but couldn't for the life of me work out why the last group is reused.

@jangorecki jangorecki changed the title [R-Forge #2590] Odd behaviour with update() in j [R-Forge #2590] During grouping, last group overwrites memory used by previous groups Apr 5, 2020
@jangorecki jangorecki added the High label Apr 5, 2020
@jangorecki
Copy link
Member

jangorecki commented Apr 6, 2020

This is a broad issue caused by a memory location being re-used between groups. This is normally not a problem, but in some cases it can manifest. For example in case of plot function where graphics devices access the memory which has been already re-used by further groups, or when objects are being modified by reference, like in stats::update.

Linked SO example simplified:

DT = data.table(id1 = c(1L,1L,1L,2L,2L), v1 = c(15,13,14,18,20), v2 = c(8,7,7,13,15))
fit = DT[, list(list(lm(v1 ~ v2))), by = id1]
sapply(fit$V1, nobs)
#[1] 3 2
fit2 = fit[, list(list(update(V1[[1L]], ~ . -v1))), by = id1]
sapply(fit2$V1, nobs)
#[1] 3 3 ## should be: 3 2

For now the workaround is to copy objects that might be modifed (like update), or used by reference (plot).

Related issues:

@jan-glx
Copy link
Contributor

jan-glx commented May 16, 2023

With reference counting on by default now, shouldn't it now be possible to fix? (Was just again badly bitten by this)

@ben-schwen
Copy link
Member

AFAIK reference counting is only the default from 4.0.0 onwards, right? Since we still support 3.1.0

@jangorecki
Copy link
Member

We are not limited by R version. Already in many places we provide features conditionally, so we could use reference counting if R supports it, otherwise just leave as it is. Myself I would be more in favour of avoiding such a big feature to be used conditionally, as maintenance of it can be extra burden. On the other hand it can be good for a transition period, in case of a bug in new mechanism, user can easily fall back to previous implementation.

@jan-glx
Copy link
Contributor

jan-glx commented May 19, 2023

Trying to avoid version specific behavior is very reasonable but perhaps we can at least show a warning if this issue is detected (only in R≥4) (without otherwise changing behavior)? From outside of the package this would be possible with something like:

library(data.table)

with_seatbelts <- function(expr){
  vars <- ls(envir = parent.frame())
  n_refs_before <- sapply(vars, pryr:::named2, parent.frame())
  res <- eval(expr, envir = parent.frame())
 
  n_refs_after <- sapply(vars, pryr:::named2, parent.frame())
  vars_with_refs_increased <- vars[n_refs_after>n_refs_before]
  #print(data.table(vars,n_refs_before,n_refs_after))
  if(as.integer(R.version$major) >= 4 && evalq(.NGRP>.GRP, envir = parent.frame()) && length(vars_with_refs_increased)>0) warning(paste0(
    "Variable '", vars_with_refs_increased, "' used in i such at least one reference persists, ",
    "data.data should make a copy of it now but instead is changing it by reference. ",
    "This might cause some issues for you. Avoid them by using 'copy(", vars_with_refs_increased ,")' instead. ",
    "If R was compiled with reference counting turned off (the default for R<4.0) you might not see this warning). ",
    "See https://github.com/Rdatatable/data.table/issues/507 for details.\n", collapse= "\n"
    ))
  res
}


x <- list()
dt <- data.table(ii=1:3, letter=letters[1:3])

dt[, with_seatbelts({x[[ii]] <<- letter[]; NULL}), by=letter]
#> Empty data.table (0 rows and 1 cols): letter
x
#> [[1]]
#> [1] "a"
#> 
#> [[2]]
#> [1] "b"
#> 
#> [[3]]
#> [1] "c"

dt[, with_seatbelts({x[[ii]] <<- letter; NULL}), by=letter]
#> Warning in with_seatbelts({: Variable 'letter' used in i such at least one reference persists, data.data should make a copy of it now but instead is changing it by reference. This might cause some issues for you. Avoid them by using 'copy(letter)' instead. If R was compiled with reference counting turned off (the default for R<4.0) you might not see this warning). See https://github.com/Rdatatable/data.table/issues/507 for details.

#> Warning in with_seatbelts({: Variable 'letter' used in i such at least one reference persists, data.data should make a copy of it now but instead is changing it by reference. This might cause some issues for you. Avoid them by using 'copy(letter)' instead. If R was compiled with reference counting turned off (the default for R<4.0) you might not see this warning). See https://github.com/Rdatatable/data.table/issues/507 for details.
#> Empty data.table (0 rows and 1 cols): letter
x
#> [[1]]
#> [1] "c"
#> 
#> [[2]]
#> [1] "c"
#> 
#> [[3]]
#> [1] "c"

(This does does a check for every group, doing it once after all groups should be enough)

In R<4.4, checking if NAMED=2 would probably lead to too many false positives ...
This would avoid such bugs for the vast majority of users (package develops would fix them in their code for R<4.0 users as well). Then you can implement automatic coping in data.table at some point for everyone's convince after you drop support for R<4.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants