test: Add reservoir sample e2e tests by allenshen13 · Pull Request #27858 · prestodb/presto

allenshen13 · 2026-05-22T14:56:20Z

Description

Add e2e aggregation tests verifying reservoir_sample output size, processed_count, null handling, grouping behavior, and deterministic behavior on constant inputs.

Motivation and Context

Impact

Test Plan

Contributor checklist

Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
Documented new properties (with its default value), SQL syntax, functions, or other functionality.
If release notes are required, they follow the release notes guidelines.
Adequate tests were added if applicable.
CI passed.
If adding new dependencies, verified they have an OpenSSF Scorecard score of 5.0 or higher (or obtained explicit TSC approval for lower scores).

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== NO RELEASE NOTE ==

Summary by Sourcery

Tests:

Add e2e aggregation tests verifying reservoir_sample output size, processed_count, null handling, grouping behavior, and deterministic behavior on constant inputs.

sourcery-ai · 2026-05-22T14:56:27Z

Reviewer's Guide

Adds end-to-end tests for the reservoir_sample aggregation to validate basic deterministic properties of its output over the orders table, including processed_count behavior, sample size limits, value domain, handling of constant and null inputs, and group-by semantics.

File-Level Changes

Change	Details	Files
Add end-to-end tests for the reservoir_sample aggregation function in the aggregation test suite.	Introduce testReservoirSample JUnit test method in AbstractTestAggregations Assert processed_count equals number of processed rows with and without null inputs Verify sample cardinality when input size is greater than or less than desired sample size Check that all sampled values lie within the expected positive domain Confirm deterministic behavior when sampling from a constant input Validate processed_count semantics under GROUP BY by comparing with count(*) per group	`presto-tests/src/main/java/com/facebook/presto/tests/AbstractTestAggregations.java`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've found 2 issues, and left some high level feedback:

The tests hardcode assumptions about the scale and row count of the orders table (e.g., SELECT 15000), which could make them brittle if the test dataset changes; consider expressing these expectations in terms of count(*) queries instead of fixed constants.
Several assertions scan the full orders table independently, which may slow down the e2e suite; consider combining related checks into fewer queries or using WITH subqueries / CTEs to reuse a single aggregation where possible.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The tests hardcode assumptions about the scale and row count of the `orders` table (e.g., `SELECT 15000`), which could make them brittle if the test dataset changes; consider expressing these expectations in terms of `count(*)` queries instead of fixed constants.
- Several assertions scan the full `orders` table independently, which may slow down the e2e suite; consider combining related checks into fewer queries or using `WITH` subqueries / CTEs to reuse a single aggregation where possible.

## Individual Comments

### Comment 1
<location path="presto-tests/src/main/java/com/facebook/presto/tests/AbstractTestAggregations.java" line_range="1388" />
<code_context>
+     * specific elements. Comprehensive correctness lives in TestReservoirSampleAggregation.
+     */
+    @Test
+    public void testReservoirSample()
+    {
+        // processed_count equals the number of input rows when no initial sample is supplied.
</code_context>
<issue_to_address>
**suggestion (testing):** Add coverage for non-null initial sample and non-zero initial processed_count arguments

Current tests only call the 4‑arg `reservoir_sample` with a `NULL` initial sample and `0` `processed_count`. Please add at least one end‑to‑end test that seeds the aggregator with a non‑empty initial sample and non‑zero `processed_count`, then verifies that the final `processed_count` and sample size account for both the seed and the new rows (i.e., merge/resume behavior).

Suggested implementation:

```java
        // The sample is capped at the desired size when the input is larger.
        assertQuery(
                "SELECT cardinality(rs.sample) FROM (SELECT reservoir_sample(NULL, 0, orderkey, 10) AS rs FROM orders) t",
                "SELECT 10");

        // A non-empty initial sample and non-zero initial processed_count are incorporated into the final state.
        // Seed the aggregation with an initial sample of two elements and a processed_count of 2, then process three
        // more rows. The final processed_count should be 5, and the sample should contain five elements (the seed
        // plus the new rows, with no eviction because total < desired_sample_size).
        assertQuery(
                "SELECT rs.processed_count, cardinality(rs.sample) " +
                        "FROM (" +
                        "SELECT reservoir_sample(ARRAY[1, 2], 2, x, 5) AS rs " +
                        "FROM (VALUES 10, 20, 30) t(x)" +
                        ") u",
                "VALUES (5, 5)");

        // When the input is smaller than the desired size the entire input is kept.

```

If there are existing style or helper methods in `AbstractTestAggregations` for building VALUES queries or for asserting single-row results, you may want to adapt the new assertion to use those helpers for consistency. Also, if `reservoir_sample` has separate overloads for different types, adjust the `ARRAY[1, 2]` and `VALUES 10, 20, 30` literals to match the preferred test type (e.g., BIGINT versus INTEGER) used elsewhere in this test class.
</issue_to_address>

### Comment 2
<location path="presto-tests/src/main/java/com/facebook/presto/tests/AbstractTestAggregations.java" line_range="1405-1413" />
<code_context>
+                "SELECT cardinality(filter(rs.sample, x -> x > 0)) = cardinality(rs.sample) FROM (SELECT reservoir_sample(NULL, 0, orderkey, 20) AS rs FROM orders) t",
+                "SELECT true");
+
+        // A constant input yields a deterministic sample of that constant.
+        assertQuery(
+                "SELECT rs.sample = ARRAY[1.0E0, 1.0E0, 1.0E0] FROM (SELECT reservoir_sample(NULL, 0, 1.0E0, 3) AS rs FROM orders) t",
+                "SELECT true");
+
</code_context>
<issue_to_address>
**suggestion (testing):** Clarify the determinism assumption and/or relax the assertion on the exact array shape

This assertion hard-codes the exact sampled array `[1.0E0, 1.0E0, 1.0E0]` for a constant input, which makes the test fragile if the implementation later changes (e.g., returns fewer elements under an optimization or encodes the constant differently) while still respecting the intended contract. Consider instead asserting properties like `cardinality(rs.sample) = 3` and that all elements equal `1.0E0`. If you truly require full-array determinism for constant inputs, please add a short comment stating that this is an explicit, stable guarantee of the function.

```suggestion
        // Sampled elements are drawn from the input: every sampled orderkey is positive.
        assertQuery(
                "SELECT cardinality(filter(rs.sample, x -> x > 0)) = cardinality(rs.sample) FROM (SELECT reservoir_sample(NULL, 0, orderkey, 20) AS rs FROM orders) t",
                "SELECT true");

        // A constant input yields a sample of the constant value; verify size and contents, not exact array shape.
        assertQuery(
                "SELECT cardinality(rs.sample) = 3 AND cardinality(filter(rs.sample, x -> x = 1.0E0)) = 3 FROM (SELECT reservoir_sample(NULL, 0, 1.0E0, 3) AS rs FROM orders) t",
                "SELECT true");
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

czentgr · 2026-05-28T14:42:07Z

@allenshen13 Looks like sourcery had some good suggestion on expanding the testing? Can we take a look at that?
Also please mention that while these are added to presto-tests these tests also get executed for native from presto-native-tests through AbstractTestAggregationsNative which extends the modified class.

This link does not necessarily exist all the time.

Add testReservoirSample to the shared AbstractTestAggregations so both Presto Java and the native/Velox engine exercise the reservoir_sample aggregate. Sampled elements are chosen at random, so the test asserts deterministic properties rather than specific elements: processed_count, sample cardinality (both capped and smaller-than-desired), value domain, a constant-input sample, null counting, and per-group processed_count. Comprehensive correctness lives in TestReservoirSampleAggregation and the Velox unit tests.

allenshen13 requested review from a team, elharo, feilong-liu and jaystarshot as code owners May 22, 2026 14:56

prestodb-ci added the from:IBM PR from IBM label May 22, 2026

prestodb-ci requested a review from a team May 22, 2026 14:56

prestodb-ci requested review from NivinCS and shuangli999 and removed request for a team May 22, 2026 14:56

sourcery-ai Bot reviewed May 22, 2026

View reviewed changes

Comment thread presto-tests/src/main/java/com/facebook/presto/tests/AbstractTestAggregations.java

Comment thread presto-tests/src/main/java/com/facebook/presto/tests/AbstractTestAggregations.java Outdated

allenshen13 changed the title ~~Add reservoir sample e2e tests~~ tests: Add reservoir sample e2e tests May 22, 2026

allenshen13 changed the title ~~tests: Add reservoir sample e2e tests~~ test: Add reservoir sample e2e tests May 22, 2026

allenshen13 force-pushed the add-reservoir-sample-e2e-tests branch from 3481b37 to da40401 Compare May 22, 2026 15:25

allenshen13 requested a review from jkhaliqi May 28, 2026 14:28

allenshen13 force-pushed the add-reservoir-sample-e2e-tests branch from da40401 to 4c60ed0 Compare June 4, 2026 15:49

allenshen13 force-pushed the add-reservoir-sample-e2e-tests branch from 4c60ed0 to eb3d29c Compare June 8, 2026 06:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: Add reservoir sample e2e tests#27858

test: Add reservoir sample e2e tests#27858
allenshen13 wants to merge 1 commit into
prestodb:masterfrom
allenshen13:add-reservoir-sample-e2e-tests

allenshen13 commented May 22, 2026 •

edited

Loading

Uh oh!

sourcery-ai Bot commented May 22, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

czentgr commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

allenshen13 commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Impact

Test Plan

Contributor checklist

Release Notes

Summary by Sourcery

Uh oh!

sourcery-ai Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

czentgr commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

allenshen13 commented May 22, 2026 •

edited

Loading

sourcery-ai Bot commented May 22, 2026 •

edited

Loading