GH-43627: [R] Fix summarize() performance regression (pushdown) (#43649)

### Rationale for this change See #43627 (comment) ### What changes are included in this PR? An extra `dplyr::select()` ### Are these changes tested? Conbench should show that the performance is much better ### Are there any user-facing changes? Not slow * GitHub Issue: #43627
apache · Aug 14, 2024 · ab432b1 · ab432b1
1 parent 7c8909a
commit ab432b1
Showing 1 changed file with 9 additions and 0 deletions.
diff --git a/r/R/dplyr-summarize.R b/r/R/dplyr-summarize.R
@@ -43,6 +43,15 @@ do_arrow_summarize <- function(.data, ..., .groups = NULL) {
     hash = length(.data$group_by_vars) > 0
   )
 
+  # Do a projection here to keep only the columns we need in summarize().
+  # If possible, this will push down the column selection into the SourceNode,
+  # saving lots of wasted processing for columns we don't need. (GH-43627)
+  vars_to_keep <- unique(c(
+    unlist(lapply(exprs, all.vars)), # vars referenced in summarize
+    dplyr::group_vars(.data) # vars needed for grouping
+  ))
+  .data <- dplyr::select(.data, intersect(vars_to_keep, names(.data)))
+
   # nolint start
   # summarize() is complicated because you can do a mixture of scalar operations
   # and aggregations, but that's not how Acero works. For example, for us to do