diff --git a/NEWS.md b/NEWS.md index 62b0415c0..308d42725 100644 --- a/NEWS.md +++ b/NEWS.md @@ -12,6 +12,8 @@ 2. A regression in v1.13.0 resulted in installation on Mac often failing with `shared object 'datatable.so' not found`, and FreeBSD always failing with `expr: illegal option -- l`, [#4652](https://github.com/Rdatatable/data.table/issues/4652) [#4640](https://github.com/Rdatatable/data.table/issues/4640) [#4650](https://github.com/Rdatatable/data.table/issues/4650). Thanks to many for assistance including Simon Urbanek, Brian Ripley, Wes Morgan, and @ale07alvarez. There were no installation problems on Windows or Linux. +3. Operating on columns of type `list`, e.g. `dt[, listCol[[1]], by=id]`, suffered a performance regression in v1.13.0, [#4646](https://github.com/Rdatatable/data.table/issues/4646) [#4658](https://github.com/Rdatatable/data.table/issues/4658). Thanks to @fabiocs8 and @sandoronodi for the detailed reports, and to Cole Miller for substantial debugging, investigation and proposals at C level which enabled the root cause to be fixed. + ## NOTES 1. `bit64` v4.0.2 and `bit` v4.0.3, both released on 30th July, broke `data.table`'s tests. It seems that reverse dependency testing of `bit64` (i.e. testing of the packages which use `bit64`) did not include `data.table` because `data.table` suggests `bit64` but does not depend on it. Like other packages on our `Suggest` list, we test `data.table` works with `bit64` in our tests. In testing of our own reverse dependencies (packages which use `data.table`) we do include packages which suggest `data.table`, although it appears it is not CRAN policy to do so. We have requested that CRAN policy be improved to include suggests in reverse dependency testing. diff --git a/inst/tests/tests.Rraw b/inst/tests/tests.Rraw index c966a5efc..058c7db3b 100644 --- a/inst/tests/tests.Rraw +++ b/inst/tests/tests.Rraw @@ -17127,3 +17127,14 @@ test(2151, dcast(DT, 1 ~ a, value.var='survmean'), data.table('.'='.', s=1L, x=2 y = person(given='Joel', family='Mossong') test(2152, copy(y), y) +# .N and .GRP special statics copied correctly when placed as a vector in a list column; part of PR#4655 +# see comments in anySpecialStatic() at the top of dogroups.c +# .SD, .I and .BY are covered by previous tests +DT = data.table(x=c(1L,2L,2L), y=1:3) +test(2153.1, DT[, .(list(.N)), by=x], data.table(x=1:2, V1=as.list(1:2))) +test(2153.2, DT[, .(list(.GRP)), by=x], data.table(x=1:2, V1=as.list(1:2))) +test(2153.3, ans<-DT[, .(list(.NGRP)), by=x], data.table(x=1:2, V1=list(2L,2L))) +test(2153.4, address(ans$V1[[1L]]), address(ans$V1[[2L]])) # .NGRP doesn't change group to group so the same object can be referenced many times unlike .N and .GRP +test(2153.5, DT[, .(list(c(0L,.N,0L))), by=x], # c() here will create new object so this is ok anyway; i.e. address(.N) is not present in j's result + data.table(x=1:2, V1=list(c(0L,1L,0L), c(0L,2L,0L)))) + diff --git a/src/assign.c b/src/assign.c index 88fc26065..579af7d3e 100644 --- a/src/assign.c +++ b/src/assign.c @@ -670,13 +670,6 @@ SEXP assign(SEXP dt, SEXP rows, SEXP cols, SEXP newcolnames, SEXP values) return(dt); // needed for `*tmp*` mechanism (when := isn't used), and to return the new object after a := for compound syntax. } -static bool anyNamed(SEXP x) { - if (MAYBE_REFERENCED(x)) return true; - if (isNewList(x)) for (int i=0; i #include +static bool anySpecialStatic(SEXP x) { + // Special refers to special symbols .BY, .I, .N, and .GRP; see special-symbols.Rd + // Static because these are like C static arrays which are the same memory for each group; e.g., dogroups + // creates .SD for the largest group once up front, overwriting the contents for each group. Their + // value changes across group but not their memory address. (.NGRP is also special static but its value + // is constant across groups so that's excluded here.) + // This works well, other than a relatively rare case when two conditions are both true : + // 1) the j expression returns a group column as-is without doing any aggregation + // 2) that result is placed in a list column result + // The list column result can then incorrectly contain the result for the last group repeated for all + // groups because the list column ends up holding a pointer to these special static vectors. + // See test 2153, and to illustrate here, consider a simplified test 1341 + // > DT + // x y + // + // 1: 1 1 + // 2: 2 2 + // 3: 1 3 + // 4: 2 4 + // > DT[, .(list(y)), by=x] + // x V1 + // + // 1: 1 2,4 # should be 1,3 + // 2: 2 2,4 + // + // This has been fixed for a decade but the solution has changed over time. + // + // We don't wish to inspect the j expression for these cases because there are so many; e.g. user defined functions. + // A special symbol does not need to appear in j for the problem to occur. Using a member of .SD is enough as the example above illustrates. + // Using R's own reference counting could invoke too many unnecessary copies because these specials are routinely referenced. + // Hence we mark these specials (SD, BY, I) here in dogroups and if j's value is being assigned to a list column, we check to + // see if any specials are present and copy them if so. + // This keeps the special logic in one place in one file here. Previously this copy was done by memrecycle in assign.c but then + // with PR#4164 started to copy input list columns too much. Hence PR#4655 in v1.13.2 moved that copy here just where it is needed. + // Currently the marker is negative truelength. These specials are protected by us here and before we release them + // we restore the true truelength for when R starts to use vector truelength. + if (isVectorAtomic(x)) + return TRUELENGTH(x)<0; + if (isNewList(x)) for (int i=0; i maxGrpSize) maxGrpSize = ilens[i]; } defineVar(install(".I"), I = PROTECT(allocVector(INTSXP, maxGrpSize)), env); nprotect++; + SET_TRUELENGTH(I, -maxGrpSize); // marker for anySpecialStatic(); see its comments R_LockBinding(install(".I"), env); SEXP dtnames = PROTECT(getAttrib(dt, R_NamesSymbol)); nprotect++; // added here to fix #91 - `:=` did not issue recycling warning during "by" @@ -69,23 +118,25 @@ SEXP dogroups(SEXP dt, SEXP dtcols, SEXP groups, SEXP grpcols, SEXP jiscols, SEX SEXP names = PROTECT(getAttrib(SDall, R_NamesSymbol)); nprotect++; if (length(names) != length(SDall)) error(_("length(names)!=length(SD)")); SEXP *nameSyms = (SEXP *)R_alloc(length(names), sizeof(SEXP)); + for(int i=0; i1 && thislen!=maxn && grpn>0) { // grpn>0 for grouping empty tables; test 1986 error(_("Supplied %d items for column %d of group %d which has %d rows. The RHS length must either be 1 (single values are ok) or match the LHS length exactly. If you wish to 'recycle' the RHS please use rep() explicitly to make this intent clear to readers of your code."), thislen, j+1, i+1, maxn); } + bool copied = false; + if (isNewList(target) && anySpecialStatic(source)) { // see comments in anySpecialStatic() + source = PROTECT(duplicate(source)); + copied = true; + } memrecycle(target, R_NilValue, thisansloc, maxn, source, 0, -1, 0, ""); + if (copied) UNPROTECT(1); } } ansloc += maxn; @@ -358,8 +422,20 @@ SEXP dogroups(SEXP dt, SEXP dtcols, SEXP groups, SEXP grpcols, SEXP jiscols, SEX } } else ans = R_NilValue; // Now reset length of .SD columns and .I to length of largest group, otherwise leak if the last group is smaller (often is). - for (int j=0; j0;