Support renaming columns in .SD by supplying named character vector to .SDcols #3146

renkun-ken · 2018-11-15T00:08:57Z

It would be nice to support functionality like

library(data.table)
x <- as.data.table(mtcars)
x[, .(sum = .SD$a + .SD$b + .SD$disp), by = vs, .SDcols = c(a = "mpg", b = "cyl", "disp")]

so that the columns a and b points to can be easily determined dynamically while retaining the readability and performance in j in contrast with using get() which forces .SD to include all columns. The performance impact can be significant when operation is done by group.

The text was updated successfully, but these errors were encountered:

jangorecki · 2018-11-15T05:06:55Z

OK, now I got it. Haven't seen .SDcols was named vector.

MichaelChirico · 2018-11-15T05:45:45Z

This looks nice but I don't think this example is representative.

In this case, simply setnames(x, c('mpg', 'cyl'), c('a', 'b')) i think is canonical and sufficient... could you flesh out a better use case?

renkun-ken · 2018-11-15T07:37:35Z

Sorry the code is too trivial to be a valid use case, just to demonstrate. Here's my use case:

Suppose we have a big data.table of grouped data with many indexed columns like x1, x2, ...

library(data.table)

dt <- data.table(group = rep(1:1000, each = 10000))
dt[, paste0("x", 1:10) := lapply(1:10, function(i) rnorm(.N))]
dt[, paste0("y", 1:10) := lapply(1:10, function(i) rnorm(.N))]

dt[, paste0("p", 1:10) := lapply(1:10, function(i) rnorm(.N))]
dt[, paste0("q", 1:10) := lapply(1:10, function(i) rnorm(.N))]

We need to pick out a number of corresponding columns for each index to calculate new columns, e.g.,

xy(i) = group size * (x(i) - y(i)) / (x(i) + y(i))

therefore we iterate over indices:

system.time({
  for (i in 1:10) {
    dt[, paste0("xy", i) := .N * (get(paste0("x", i)) - get(paste0("y", i))) / (get(paste0("x", i)) + get(paste0("y", i))), by = group]
  }
})

which looks a bit redundant and costs us

   user  system elapsed 
  7.937   0.630   5.343

The overhead can be calling get, subsetting all other used columns in dt for each group due to get appears. To simplify, rewrite it to:

system.time({
  for (i in 1:10) {
    dt[, paste0("xy", i) := {
      xi <- get(paste0("x", i))
      yi <- get(paste0("y", i))
      .N * (xi - yi) / (xi + yi)
    }, by = group]
  }
})

   user  system elapsed 
  8.394   0.347   5.380

Using .SDcols and .SD can reduce unnecessary subsetting:

system.time({
  for (i in 1:10) {
    dt[, paste0("xy", i) := .N * (.SD[[1L]] - .SD[[2L]]) / (.SD[[1L]] + .SD[[2L]]), by = group, .SDcols = c(paste0("x", i), paste0("y", i))]
  }
})

   user  system elapsed 
  3.587   0.012   1.704

But the code becomes unreadable especially when more columns are involved. Think about (.SD[[1]] + .SD[[2]]) * (.SD[[2]] - .SD[[3]] * .SD[[1]]) * (.SD[[1]] - .SD[[3]])

Using setnames in each iteration makes the code more readable:

system.time({
  for (i in 1:10) {
    old_names <- c(paste0("x", i), paste0("y", i))
    new_names <- c("xi", "yi")
    setnames(dt, old_names, new_names)
    dt[, paste0("xy", i) := .N * (xi - yi) / (xi + yi), by = group]
    setnames(dt, new_names, old_names)
  }
})

and its performance is good too:

   user  system elapsed 
  3.412   0.294   1.505

But the drawback is that when error occurs in j, the names of dt is left corrupted.

Therefore I suggest the following usage:

for (i in 1:10) {
  dt[, paste0("xy", i) := .N * (.SD$xi - .SD$yi) / (.SD$xi + .SD$yi), by = group,
    .SDcols = c(xi = paste0("x", i), yi = paste0("y", i))]
}

which is quite similar with the idea suggested in #2884 but looks easier to implement.

franknarf1 · 2018-11-15T09:02:16Z

Related:
#1803 (comment)

MichaelChirico · 2024-08-22T06:15:32Z

I am marking this as a duplicate of #5020 -- I think it's the same FR, and there's a lot more discussion there. I also think the new env= at least partially obviates this.

I haven't read super carefully, so please feel free to re-open if you think there's some separation to be had between the requests.

jangorecki added the programming parameterizing queries: get, mget, eval, env label Apr 5, 2020

MichaelChirico mentioned this issue Apr 30, 2020

[Request] referring to columns by new names in .SDcols #1803

Closed

MichaelChirico closed this as completed Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support renaming columns in .SD by supplying named character vector to .SDcols #3146

Support renaming columns in .SD by supplying named character vector to .SDcols #3146

renkun-ken commented Nov 15, 2018

jangorecki commented Nov 15, 2018 •

edited

Loading

MichaelChirico commented Nov 15, 2018

renkun-ken commented Nov 15, 2018 •

edited

Loading

franknarf1 commented Nov 15, 2018

MichaelChirico commented Aug 22, 2024

Support renaming columns in .SD by supplying named character vector to .SDcols #3146

Support renaming columns in .SD by supplying named character vector to .SDcols #3146

Comments

renkun-ken commented Nov 15, 2018

jangorecki commented Nov 15, 2018 • edited Loading

MichaelChirico commented Nov 15, 2018

renkun-ken commented Nov 15, 2018 • edited Loading

franknarf1 commented Nov 15, 2018

MichaelChirico commented Aug 22, 2024

jangorecki commented Nov 15, 2018 •

edited

Loading

renkun-ken commented Nov 15, 2018 •

edited

Loading