-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support renaming columns in .SD by supplying named character vector to .SDcols #3146
Comments
OK, now I got it. Haven't seen |
This looks nice but I don't think this example is representative. In this case, simply |
Sorry the code is too trivial to be a valid use case, just to demonstrate. Here's my use case: Suppose we have a big data.table of grouped data with many indexed columns like library(data.table)
dt <- data.table(group = rep(1:1000, each = 10000))
dt[, paste0("x", 1:10) := lapply(1:10, function(i) rnorm(.N))]
dt[, paste0("y", 1:10) := lapply(1:10, function(i) rnorm(.N))]
dt[, paste0("p", 1:10) := lapply(1:10, function(i) rnorm(.N))]
dt[, paste0("q", 1:10) := lapply(1:10, function(i) rnorm(.N))] We need to pick out a number of corresponding columns for each index to calculate new columns, e.g., xy(i) = group size * (x(i) - y(i)) / (x(i) + y(i)) therefore we iterate over indices: system.time({
for (i in 1:10) {
dt[, paste0("xy", i) := .N * (get(paste0("x", i)) - get(paste0("y", i))) / (get(paste0("x", i)) + get(paste0("y", i))), by = group]
}
}) which looks a bit redundant and costs us
The overhead can be calling system.time({
for (i in 1:10) {
dt[, paste0("xy", i) := {
xi <- get(paste0("x", i))
yi <- get(paste0("y", i))
.N * (xi - yi) / (xi + yi)
}, by = group]
}
})
Using system.time({
for (i in 1:10) {
dt[, paste0("xy", i) := .N * (.SD[[1L]] - .SD[[2L]]) / (.SD[[1L]] + .SD[[2L]]), by = group, .SDcols = c(paste0("x", i), paste0("y", i))]
}
})
But the code becomes unreadable especially when more columns are involved. Think about Using system.time({
for (i in 1:10) {
old_names <- c(paste0("x", i), paste0("y", i))
new_names <- c("xi", "yi")
setnames(dt, old_names, new_names)
dt[, paste0("xy", i) := .N * (xi - yi) / (xi + yi), by = group]
setnames(dt, new_names, old_names)
}
}) and its performance is good too:
But the drawback is that when error occurs in Therefore I suggest the following usage: for (i in 1:10) {
dt[, paste0("xy", i) := .N * (.SD$xi - .SD$yi) / (.SD$xi + .SD$yi), by = group,
.SDcols = c(xi = paste0("x", i), yi = paste0("y", i))]
} which is quite similar with the idea suggested in #2884 but looks easier to implement. |
Related: |
I am marking this as a duplicate of #5020 -- I think it's the same FR, and there's a lot more discussion there. I also think the new I haven't read super carefully, so please feel free to re-open if you think there's some separation to be had between the requests. |
It would be nice to support functionality like
so that the columns
a
andb
points to can be easily determined dynamically while retaining the readability and performance inj
in contrast with usingget()
which forces.SD
to include all columns. The performance impact can be significant when operation is done by group.The text was updated successfully, but these errors were encountered: