Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dt_unnest() is slow with data.table 1.13.0 update #34

Closed
markfairbanks opened this issue Aug 18, 2020 · 3 comments · Fixed by #35
Closed

dt_unnest() is slow with data.table 1.13.0 update #34

markfairbanks opened this issue Aug 18, 2020 · 3 comments · Fixed by #35

Comments

@markfairbanks
Copy link
Collaborator

There's a bug in data.table v1.13.0 that causes unnesting to be slow if you use the [[1]] method. The issue is still open here, along with the current workaround (which I used below).

Want me to open a PR using the workaround?

library(data.table)

n <- 500L
n_nested <- 40L
test_df <- data.table(
  id = seq_len(n),
  value = replicate(n, data.table(val1 = sample(n_nested)), simplify = FALSE))

updated_unnest <- function(dt_, col){
  if (isFALSE(is.data.table(dt_)))
    dt_ <- as.data.table(dt_)

  # col to unnest
  col <- substitute(col)
  if (length(col) > 1)
    stop("dt_unnest() currently can only unnest a single column at a time", call. = FALSE)

  # Get the others variables in there
  names <- colnames(dt_)
  if(!paste(col) %in% names)
    stop("Could not find `cols` in data.table", call. = FALSE)
  others <- names[-match(paste(col), names)]
  others_class = sapply(others, function(x) class(dt_[[x]])[1L])
  others = others[!others_class %in% c("list", "data.table", "data.frame", "tbl_df")]

  # Join them all together
  dt_[seq_len(.N), eval(col)[[1L]], by = others][dt_, on = others]
}

bench::mark(current = tidyfast::dt_unnest(test_df, value),
            updated = updated_unnest(test_df, value),
            check = FALSE, iterations = 30)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 current     137.7ms  157.9ms      5.87   59.64MB     7.82
#> 2 updated      3.15ms   3.59ms    273.      1.54MB     0
@TysonStanley
Copy link
Owner

Thank you! I thought I had seen something about that. It wasn't the speed up I was hoping for so I'm glad there is a temporary fix on that. As always, you are a fantastic collaborator.

@TysonStanley
Copy link
Owner

And a PR would be great, if you have a moment to do that!

@TysonStanley
Copy link
Owner

Hey @markfairbanks I just added this one since I was already doing some code cleaning so don't worry about a PR here. But if you have any thoughts on #25 let me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants