fix(functions-aggregate): drain CORR state vectors for streaming aggregation by geoffreyclaude · Pull Request #19669 · apache/datafusion

geoffreyclaude · 2026-01-06T16:08:03Z

Which issue does this PR close?

N/A

Rationale for this change

This change addresses a failure in the CORR aggregate function when running in streaming mode. The CorrelationGroupsAccumulator (introduced in PR #13581) was failing to drain its state vectors during EmitTo::First calls, causing internal state to persist across emissions. This led to memory leaks, incorrect results for subsequent groups, and "length mismatch" errors because the internal vector sizes diverged from the number of emitted groups.

Reproducer

# Setup data
CREATE TABLE stream_test (
    g INT,
    x DOUBLE,
    y DOUBLE
) AS VALUES
(1, 1.0, 1.0), (1, 2.0, 2.0),
(2, 1.0, 5.0), (2, 2.0, 5.0),
(3, 1.0, 1.0), (3, 2.0, 2.0);

# Trigger streaming aggregation via sorted subquery
SELECT
  g,
  CORR(x, y)
FROM (SELECT * FROM stream_test ORDER BY g LIMIT 10000)
GROUP BY g
ORDER BY g;

Before: DataFusion error: Arrow error: Invalid argument error: all columns in a record batch must have the same length

After:

1 1
2 NULL
3 1

What changes are included in this PR?

This PR is structured into two commits: the first adds a failing test case to demonstrate the issue, and the second implements the fix.

The accumulator now uses emit_to.take_needed() in both evaluate and state to properly consume the emitted portions of the state vectors. Additionally, the size() implementation has been updated to use vector capacity for more accurate memory accounting.

Are these changes tested?

Yes, a new test case in aggregate.slt triggers streaming aggregation via an ordered subquery. This test previously crashed with an Arrow length mismatch error and now produces correct results.

Are there any user-facing changes?

Yes, SQL queries that trigger streaming aggregation using CORR (typically those with specific ordering requirements) will now succeed instead of failing with a length mismatch error.

martin-g · 2026-01-07T09:21:09Z

datafusion/sqllogictest/test_files/aggregate.slt

+2 2 NULL
+2 3 NULL
+2 4 NULL
+


It would be good to add a companion EXPLAIN query to verify that it uses the streaming path.

I had it at first, and removed it as I found it too verbose. Same with a dedicated unit test in correlation.rs, which seemed out of place and only serving as a "demo" of the bug.

Adding just the EXPLAIN for CORR seems too specific to me here. However, I think it would make a lot of sense to actually have a dedicated .slt that runs EXPLAIN and the actual query for all aggregates.

@martin-g WDYT?

EDIT: pushed new comprehensive tests in commit test: add comprehensive aggregate tests for streaming aggregation

Either way is fine as long as there is a way to assert that it behaves the way it is supposed to be.

martin-g

LGTM

Jefffrey

I think just need to resolve conflict then should be good to go @geoffreyclaude

datafusion/sqllogictest/test_files/aggregate.slt

Jefffrey · 2026-01-11T07:36:59Z

Thanks @geoffreyclaude & @martin-g

github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Jan 6, 2026

martin-g reviewed Jan 7, 2026

View reviewed changes

martin-g approved these changes Jan 7, 2026

View reviewed changes

Jefffrey approved these changes Jan 10, 2026

View reviewed changes

Jefffrey approved these changes Jan 11, 2026

View reviewed changes

datafusion/sqllogictest/test_files/aggregate.slt Show resolved Hide resolved

geoffreyclaude added 3 commits January 11, 2026 08:04

test: demonstrate failure in CORR streaming aggregation

6e16f18

fix: drain CORR state vectors on EmitTo::First in streaming aggregation

23d35cf

test: add comprehensive aggregate tests for streaming aggregation

d2e9888

geoffreyclaude force-pushed the fix/corr branch from d2bbcb0 to d2e9888 Compare January 11, 2026 07:18

Jefffrey added this pull request to the merge queue Jan 11, 2026

Merged via the queue into apache:main with commit 0c5c97b Jan 11, 2026
28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(functions-aggregate): drain CORR state vectors for streaming aggregation#19669

fix(functions-aggregate): drain CORR state vectors for streaming aggregation#19669
Jefffrey merged 3 commits intoapache:mainfrom
geoffreyclaude:fix/corr

geoffreyclaude commented Jan 6, 2026 •

edited

Loading

Uh oh!

martin-g Jan 7, 2026

Uh oh!

geoffreyclaude Jan 7, 2026 •

edited

Loading

Uh oh!

martin-g Jan 7, 2026

Uh oh!

martin-g left a comment

Uh oh!

Jefffrey left a comment

Uh oh!

Uh oh!

Uh oh!

Jefffrey commented Jan 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

+2 NULL
+3 NULL
+4 NULL

Conversation

geoffreyclaude commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

Reproducer

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

martin-g Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

geoffreyclaude Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martin-g Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

martin-g left a comment

Choose a reason for hiding this comment

Uh oh!

Jefffrey left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Jefffrey commented Jan 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

geoffreyclaude commented Jan 6, 2026 •

edited

Loading

geoffreyclaude Jan 7, 2026 •

edited

Loading