Skip to content

test: Add reservoir sample e2e tests#27858

Open
allenshen13 wants to merge 1 commit into
prestodb:masterfrom
allenshen13:add-reservoir-sample-e2e-tests
Open

test: Add reservoir sample e2e tests#27858
allenshen13 wants to merge 1 commit into
prestodb:masterfrom
allenshen13:add-reservoir-sample-e2e-tests

Conversation

@allenshen13
Copy link
Copy Markdown
Member

@allenshen13 allenshen13 commented May 22, 2026

Description

Add e2e aggregation tests verifying reservoir_sample output size, processed_count, null handling, grouping behavior, and deterministic behavior on constant inputs.

Motivation and Context

Impact

Test Plan

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.
  • If adding new dependencies, verified they have an OpenSSF Scorecard score of 5.0 or higher (or obtained explicit TSC approval for lower scores).

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== NO RELEASE NOTE ==

Summary by Sourcery

Tests:

  • Add e2e aggregation tests verifying reservoir_sample output size, processed_count, null handling, grouping behavior, and deterministic behavior on constant inputs.

@prestodb-ci prestodb-ci added the from:IBM PR from IBM label May 22, 2026
@prestodb-ci prestodb-ci requested a review from a team May 22, 2026 14:56
@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai Bot commented May 22, 2026

Reviewer's Guide

Adds end-to-end tests for the reservoir_sample aggregation to validate basic deterministic properties of its output over the orders table, including processed_count behavior, sample size limits, value domain, handling of constant and null inputs, and group-by semantics.

File-Level Changes

Change Details Files
Add end-to-end tests for the reservoir_sample aggregation function in the aggregation test suite.
  • Introduce testReservoirSample JUnit test method in AbstractTestAggregations
  • Assert processed_count equals number of processed rows with and without null inputs
  • Verify sample cardinality when input size is greater than or less than desired sample size
  • Check that all sampled values lie within the expected positive domain
  • Confirm deterministic behavior when sampling from a constant input
  • Validate processed_count semantics under GROUP BY by comparing with count(*) per group
presto-tests/src/main/java/com/facebook/presto/tests/AbstractTestAggregations.java

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@prestodb-ci prestodb-ci requested review from NivinCS and shuangli999 and removed request for a team May 22, 2026 14:56
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • The tests hardcode assumptions about the scale and row count of the orders table (e.g., SELECT 15000), which could make them brittle if the test dataset changes; consider expressing these expectations in terms of count(*) queries instead of fixed constants.
  • Several assertions scan the full orders table independently, which may slow down the e2e suite; consider combining related checks into fewer queries or using WITH subqueries / CTEs to reuse a single aggregation where possible.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The tests hardcode assumptions about the scale and row count of the `orders` table (e.g., `SELECT 15000`), which could make them brittle if the test dataset changes; consider expressing these expectations in terms of `count(*)` queries instead of fixed constants.
- Several assertions scan the full `orders` table independently, which may slow down the e2e suite; consider combining related checks into fewer queries or using `WITH` subqueries / CTEs to reuse a single aggregation where possible.

## Individual Comments

### Comment 1
<location path="presto-tests/src/main/java/com/facebook/presto/tests/AbstractTestAggregations.java" line_range="1388" />
<code_context>
+     * specific elements. Comprehensive correctness lives in TestReservoirSampleAggregation.
+     */
+    @Test
+    public void testReservoirSample()
+    {
+        // processed_count equals the number of input rows when no initial sample is supplied.
</code_context>
<issue_to_address>
**suggestion (testing):** Add coverage for non-null initial sample and non-zero initial processed_count arguments

Current tests only call the 4‑arg `reservoir_sample` with a `NULL` initial sample and `0` `processed_count`. Please add at least one end‑to‑end test that seeds the aggregator with a non‑empty initial sample and non‑zero `processed_count`, then verifies that the final `processed_count` and sample size account for both the seed and the new rows (i.e., merge/resume behavior).

Suggested implementation:

```java
        // The sample is capped at the desired size when the input is larger.
        assertQuery(
                "SELECT cardinality(rs.sample) FROM (SELECT reservoir_sample(NULL, 0, orderkey, 10) AS rs FROM orders) t",
                "SELECT 10");

        // A non-empty initial sample and non-zero initial processed_count are incorporated into the final state.
        // Seed the aggregation with an initial sample of two elements and a processed_count of 2, then process three
        // more rows. The final processed_count should be 5, and the sample should contain five elements (the seed
        // plus the new rows, with no eviction because total < desired_sample_size).
        assertQuery(
                "SELECT rs.processed_count, cardinality(rs.sample) " +
                        "FROM (" +
                        "SELECT reservoir_sample(ARRAY[1, 2], 2, x, 5) AS rs " +
                        "FROM (VALUES 10, 20, 30) t(x)" +
                        ") u",
                "VALUES (5, 5)");

        // When the input is smaller than the desired size the entire input is kept.

```

If there are existing style or helper methods in `AbstractTestAggregations` for building VALUES queries or for asserting single-row results, you may want to adapt the new assertion to use those helpers for consistency. Also, if `reservoir_sample` has separate overloads for different types, adjust the `ARRAY[1, 2]` and `VALUES 10, 20, 30` literals to match the preferred test type (e.g., BIGINT versus INTEGER) used elsewhere in this test class.
</issue_to_address>

### Comment 2
<location path="presto-tests/src/main/java/com/facebook/presto/tests/AbstractTestAggregations.java" line_range="1405-1413" />
<code_context>
+                "SELECT cardinality(filter(rs.sample, x -> x > 0)) = cardinality(rs.sample) FROM (SELECT reservoir_sample(NULL, 0, orderkey, 20) AS rs FROM orders) t",
+                "SELECT true");
+
+        // A constant input yields a deterministic sample of that constant.
+        assertQuery(
+                "SELECT rs.sample = ARRAY[1.0E0, 1.0E0, 1.0E0] FROM (SELECT reservoir_sample(NULL, 0, 1.0E0, 3) AS rs FROM orders) t",
+                "SELECT true");
+
</code_context>
<issue_to_address>
**suggestion (testing):** Clarify the determinism assumption and/or relax the assertion on the exact array shape

This assertion hard-codes the exact sampled array `[1.0E0, 1.0E0, 1.0E0]` for a constant input, which makes the test fragile if the implementation later changes (e.g., returns fewer elements under an optimization or encodes the constant differently) while still respecting the intended contract. Consider instead asserting properties like `cardinality(rs.sample) = 3` and that all elements equal `1.0E0`. If you truly require full-array determinism for constant inputs, please add a short comment stating that this is an explicit, stable guarantee of the function.

```suggestion
        // Sampled elements are drawn from the input: every sampled orderkey is positive.
        assertQuery(
                "SELECT cardinality(filter(rs.sample, x -> x > 0)) = cardinality(rs.sample) FROM (SELECT reservoir_sample(NULL, 0, orderkey, 20) AS rs FROM orders) t",
                "SELECT true");

        // A constant input yields a sample of the constant value; verify size and contents, not exact array shape.
        assertQuery(
                "SELECT cardinality(rs.sample) = 3 AND cardinality(filter(rs.sample, x -> x = 1.0E0)) = 3 FROM (SELECT reservoir_sample(NULL, 0, 1.0E0, 3) AS rs FROM orders) t",
                "SELECT true");
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@allenshen13 allenshen13 changed the title Add reservoir sample e2e tests tests: Add reservoir sample e2e tests May 22, 2026
@allenshen13 allenshen13 changed the title tests: Add reservoir sample e2e tests test: Add reservoir sample e2e tests May 22, 2026
@allenshen13 allenshen13 force-pushed the add-reservoir-sample-e2e-tests branch from 3481b37 to da40401 Compare May 22, 2026 15:25
@allenshen13 allenshen13 requested a review from jkhaliqi May 28, 2026 14:28
@czentgr
Copy link
Copy Markdown
Contributor

czentgr commented May 28, 2026

@allenshen13 Looks like sourcery had some good suggestion on expanding the testing? Can we take a look at that?
Also please mention that while these are added to presto-tests these tests also get executed for native from presto-native-tests through AbstractTestAggregationsNative which extends the modified class.

This link does not necessarily exist all the time.

@allenshen13 allenshen13 force-pushed the add-reservoir-sample-e2e-tests branch from da40401 to 4c60ed0 Compare June 4, 2026 15:49
Add testReservoirSample to the shared AbstractTestAggregations so both
Presto Java and the native/Velox engine exercise the reservoir_sample
aggregate. Sampled elements are chosen at random, so the test asserts
deterministic properties rather than specific elements: processed_count,
sample cardinality (both capped and smaller-than-desired), value domain,
a constant-input sample, null counting, and per-group processed_count.
Comprehensive correctness lives in TestReservoirSampleAggregation and the
Velox unit tests.
@allenshen13 allenshen13 force-pushed the add-reservoir-sample-e2e-tests branch from 4c60ed0 to eb3d29c Compare June 8, 2026 06:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:IBM PR from IBM

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants