Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support renaming columns in .SD by supplying named character vector to .SDcols #3146

Closed
renkun-ken opened this issue Nov 15, 2018 · 5 comments
Labels
programming parameterizing queries: get, mget, eval, env

Comments

@renkun-ken
Copy link
Member

It would be nice to support functionality like

library(data.table)
x <- as.data.table(mtcars)
x[, .(sum = .SD$a + .SD$b + .SD$disp), by = vs, .SDcols = c(a = "mpg", b = "cyl", "disp")]

so that the columns a and b points to can be easily determined dynamically while retaining the readability and performance in j in contrast with using get() which forces .SD to include all columns. The performance impact can be significant when operation is done by group.

@jangorecki
Copy link
Member

jangorecki commented Nov 15, 2018

OK, now I got it. Haven't seen .SDcols was named vector.

@MichaelChirico
Copy link
Member

This looks nice but I don't think this example is representative.

In this case, simply setnames(x, c('mpg', 'cyl'), c('a', 'b')) i think is canonical and sufficient... could you flesh out a better use case?

@renkun-ken
Copy link
Member Author

renkun-ken commented Nov 15, 2018

Sorry the code is too trivial to be a valid use case, just to demonstrate. Here's my use case:

Suppose we have a big data.table of grouped data with many indexed columns like x1, x2, ...

library(data.table)

dt <- data.table(group = rep(1:1000, each = 10000))
dt[, paste0("x", 1:10) := lapply(1:10, function(i) rnorm(.N))]
dt[, paste0("y", 1:10) := lapply(1:10, function(i) rnorm(.N))]

dt[, paste0("p", 1:10) := lapply(1:10, function(i) rnorm(.N))]
dt[, paste0("q", 1:10) := lapply(1:10, function(i) rnorm(.N))]

We need to pick out a number of corresponding columns for each index to calculate new columns, e.g.,

xy(i) = group size * (x(i) - y(i)) / (x(i) + y(i))

therefore we iterate over indices:

system.time({
  for (i in 1:10) {
    dt[, paste0("xy", i) := .N * (get(paste0("x", i)) - get(paste0("y", i))) / (get(paste0("x", i)) + get(paste0("y", i))), by = group]
  }
})

which looks a bit redundant and costs us

   user  system elapsed 
  7.937   0.630   5.343 

The overhead can be calling get, subsetting all other used columns in dt for each group due to get appears. To simplify, rewrite it to:

system.time({
  for (i in 1:10) {
    dt[, paste0("xy", i) := {
      xi <- get(paste0("x", i))
      yi <- get(paste0("y", i))
      .N * (xi - yi) / (xi + yi)
    }, by = group]
  }
})
   user  system elapsed 
  8.394   0.347   5.380 

Using .SDcols and .SD can reduce unnecessary subsetting:

system.time({
  for (i in 1:10) {
    dt[, paste0("xy", i) := .N * (.SD[[1L]] - .SD[[2L]]) / (.SD[[1L]] + .SD[[2L]]), by = group, .SDcols = c(paste0("x", i), paste0("y", i))]
  }
})
   user  system elapsed 
  3.587   0.012   1.704 

But the code becomes unreadable especially when more columns are involved. Think about (.SD[[1]] + .SD[[2]]) * (.SD[[2]] - .SD[[3]] * .SD[[1]]) * (.SD[[1]] - .SD[[3]])

Using setnames in each iteration makes the code more readable:

system.time({
  for (i in 1:10) {
    old_names <- c(paste0("x", i), paste0("y", i))
    new_names <- c("xi", "yi")
    setnames(dt, old_names, new_names)
    dt[, paste0("xy", i) := .N * (xi - yi) / (xi + yi), by = group]
    setnames(dt, new_names, old_names)
  }
})

and its performance is good too:

   user  system elapsed 
  3.412   0.294   1.505 

But the drawback is that when error occurs in j, the names of dt is left corrupted.

Therefore I suggest the following usage:

for (i in 1:10) {
  dt[, paste0("xy", i) := .N * (.SD$xi - .SD$yi) / (.SD$xi + .SD$yi), by = group,
    .SDcols = c(xi = paste0("x", i), yi = paste0("y", i))]
}

which is quite similar with the idea suggested in #2884 but looks easier to implement.

@franknarf1
Copy link
Contributor

Related:
#1803 (comment)

@jangorecki jangorecki added the programming parameterizing queries: get, mget, eval, env label Apr 5, 2020
@MichaelChirico
Copy link
Member

I am marking this as a duplicate of #5020 -- I think it's the same FR, and there's a lot more discussion there. I also think the new env= at least partially obviates this.

I haven't read super carefully, so please feel free to re-open if you think there's some separation to be had between the requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
programming parameterizing queries: get, mget, eval, env
Projects
None yet
Development

No branches or pull requests

4 participants