Skip to content

Commit

Permalink
GH-43627: [R] Fix summarize() performance regression (pushdown) (#43649)
Browse files Browse the repository at this point in the history
### Rationale for this change

See #43627 (comment)

### What changes are included in this PR?

An extra `dplyr::select()`

### Are these changes tested?

Conbench should show that the performance is much better

### Are there any user-facing changes?

Not slow
* GitHub Issue: #43627
  • Loading branch information
nealrichardson authored Aug 14, 2024
1 parent 7c8909a commit ab432b1
Showing 1 changed file with 9 additions and 0 deletions.
9 changes: 9 additions & 0 deletions r/R/dplyr-summarize.R
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,15 @@ do_arrow_summarize <- function(.data, ..., .groups = NULL) {
hash = length(.data$group_by_vars) > 0
)

# Do a projection here to keep only the columns we need in summarize().
# If possible, this will push down the column selection into the SourceNode,
# saving lots of wasted processing for columns we don't need. (GH-43627)
vars_to_keep <- unique(c(
unlist(lapply(exprs, all.vars)), # vars referenced in summarize
dplyr::group_vars(.data) # vars needed for grouping
))
.data <- dplyr::select(.data, intersect(vars_to_keep, names(.data)))

# nolint start
# summarize() is complicated because you can do a mixture of scalar operations
# and aggregations, but that's not how Acero works. For example, for us to do
Expand Down

0 comments on commit ab432b1

Please sign in to comment.