ESQL: Add a random sample command #123879

bpintea · 2025-03-03T14:05:56Z

This adds a new command, RANDOM_SAMPLE <probability> <seed>. The command is producing at output a random selection of the rows it received on the input, maintaining their input order.

It receives a probability as a first mandatory parameter, a double in the (0.0, 1.0) interval, and an optional seed, an integer value. The former determines the selectivity of the sampling, the latter its reproducibility. If no seed is provided, a random one will be used.

The command will be "pushed down" to Lucene whenever possible, using the existing RandomSamplingQuery (used by the Random sampler aggregation).
This is as usually the most efficient way to execute the command since this controls which documents will be considered for loading (or further filtering) and it will also run the selection over an entire Lucene segment.

When no "push down" is possible, an ESQL operator caches a number of input pages up to a (minimum) number of rows
(currently equal to the configured maximum result set size) and then samples this batch. Meaning that (1) all the cached pages are first collected in memory (i.e. high memory pressure) and (2), the sampling runs over a significantly reduced document ID space, compared to that of a segment's (i.e. poor sampling with low probabilities).

This adds RANDOM_SAMPLE <probability> <seed>? command.

elasticsearchmachine · 2025-03-03T14:06:45Z

Hi @bpintea, I've created a changelog YAML for you.

nik9000

If it can't be pushed to Lucene it looks like this will randomly sample the output rows from the previous steps. That's what I'd imagine for stuff like

FROM foo
| STATS a BY b, c, d
| RANDOM SAMPLE 0.01

But that's different than sampling the underlying docs. I think. Are we linking the behavior to how the optimization works?

nik9000 · 2025-03-10T14:07:50Z

x-pack/plugin/esql/compute/src/main/java/org/elasticsearch/compute/data/Page.java

+            }
+        }
+        return new Page(filteredBlocks);
+    }


nik9000 · 2025-03-10T14:08:40Z

...ugin/esql/compute/src/main/java/org/elasticsearch/compute/operator/RandomSampleOperator.java

+    private final int seed;
+
+    private boolean collecting = true;
+    private boolean isFinished = false;


I wonder if these should be an enum with three values. Dunno.

jan-elastic

I gave this a quick first pass. I think it looks great, and contains all the necessary pieces that I had in mind. Nice job!

I've left a few comments and will revisit it in more details later this week.

jan-elastic · 2025-03-10T13:56:48Z

x-pack/plugin/esql/qa/testFixtures/src/main/resources/random_sample.csv-spec

@@ -0,0 +1,220 @@
+// Tests focused on the RANDOM_SAMPLE command


The following test fails

FROM data | SORT @timestamp | RANDOM_SAMPLE 0.1 | LIMIT 1000

with

Unbounded sort not supported yet [SORT @timestamp] please add a limit

Yes, this unbounded sort is a current limitation with ESQL, nothing specific to this command (same happens if you swap RANDOM_SAMPLE with STATS, or other command that LIMIT cannot be pushed down by).

jan-elastic · 2025-03-10T13:58:23Z

...ava/org/elasticsearch/xpack/esql/optimizer/rules/logical/PushDownAndCombineRandomSample.java

+            var seed = combinedSeed(context, randomSample, rsChild);
+            plan = new RandomSample(randomSample.source(), probability, seed, rsChild.child());
+        } else if (child instanceof Enrich
+            || child instanceof Eval


What about SORT?

I think you can also swap SORT and RANDOM_SAMPLE, and sorting less data is more efficient.

That probably also solves the failing

FROM data | SORT @timestamp | RANDOM_SAMPLE 0.1 | LIMIT 1000

If you don't push down the RANDOM_SAMPLE, you need the SORT needs to be a TOP_N with N=limit/random_sample_p or so.

jan-elastic · 2025-03-10T14:02:29Z

...ava/org/elasticsearch/xpack/esql/optimizer/rules/logical/PushDownAndCombineRandomSample.java

+            var seed = combinedSeed(context, randomSample, rsChild);
+            plan = new RandomSample(randomSample.source(), probability, seed, rsChild.child());
+        } else if (child instanceof Enrich
+            || child instanceof Eval


What about WHERE?

I think you can swap WHERE and RANDOM_SAMPLE, and I think it's beneficial to move RANDOM_SAMPLE to the ES query if possible. E.g.

FROM data | WHERE SUBSTRING(message, 0, 4) == "test" | RANDOM_SAMPLE 0.1

Both WHERE and SORT reordering would change the query output, considering that RANDOM_SAMPLE is deterministic, and if multi-node queries and Lucene were all deterministic or forced-to-be through a SORT/STATS.

Not sure if that's the idea here though, but it feels like a relevant decision. That said, I'm not sure if the push down to lucene will have the same results. So talking about determinism here is tricky...

If we ignore the determinism / reproducibility with fixed seeds, you should be able to swap them.

Regarding WHERE: regardless of the ordering, the end up when a random 10% of rows matching the WHERE.

Regarding SORT: regardless of the ordering, you get a random 10% of the rows in sorted order.

Making sure everything is reproducible when you specify a fixed seed, sounds like a (possibly hard) implementation detail, but not something that dictates whether you can push down or not.

Thinking out loud: if that's really hard, we can also consider not implementing the fixed seed at all for now. (Have to discuss that with product of course.)

A general guideline of command reordering / optimisation is to only do it if it doesn't alter the outcome (maybe with the exception of chaining the sampling, as you're raising the point below). While in theory randomly choosing and sorting will produce conceptually a similar result, it wouldn't be the same one.
A similar guideline is to have the same results irrespective of a set of the operations being pushed down or not (i.e. "disabled optimisations").
Meaning that WHERE and SORT would not slide past sampling.

Regarding WHERE: regardless of the ordering, the end up when a random 10% of rows matching the WHERE.

Yes, but can be a totally different set that's sampled -- think of | WHERE foo IS NOT NULL.

Regarding SORT: regardless of the ordering, you get a random 10% of the rows in sorted order.

Again, | SORT ... NULLS LAST.

Nothing specific to the NULLs, just how the pool of rows to sample forms.

jan-elastic · 2025-03-10T14:12:55Z

...ava/org/elasticsearch/xpack/esql/optimizer/rules/logical/PushDownAndCombineRandomSample.java

+        if (parentSeed != null) {
+            if (childSeed != null) {
+                var seedValue = (int) Foldables.valueOf(context.foldCtx(), parentSeed);
+                seedValue ^= (int) Foldables.valueOf(context.foldCtx(), childSeed);


Does this give identical results to executing the two individual RANDOM_SAMPLEs?

If not, is that a problem?

Does this give identical results to executing the two individual RANDOM_SAMPLEs?

Likely not: currently, both implementations of the sampling - pushed down to RandomSamplingQuery or "kept" in the exec engine, in RandomSampleOperator - the sampling occurs over a limited number of documents, irrespective of the total set of available documents. In the former case, the limit is the total docs available in a segment (up to int's max - 1), the latter, it's a config option (10K). Intuitively, I think the lower the given probability, the better the sampling will be if run over a large number of docs (if sampling with 50% over multiple batches, be those segments or rows in pages, you will end up with more than 50% of all batches/rows).
And compounding multiple sampling operations will lower the probability. However, ...

If not, is that a problem?

...I would think that in practice it's not, if sampling over a large document set, but just due to the statistical nature of the operation. However, I realise this is debatable.

jan-elastic · 2025-03-10T14:17:45Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/parser/LogicalPlanBuilder.java

+        Literal seed;
+        if (ctx.seed != null) {
+            seed = visitIntegerValue(ctx.seed);
+            if (seed.dataType() != DataType.INTEGER) {


Doesn't visitIntegerValue do this check? If not, why not?

The "integer value" is non-floating point value, which includes UNSIGNED_LONG.

jan-elastic · 2025-03-10T14:19:59Z

x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/analysis/AnalyzerTests.java

@@ -2891,6 +2892,18 @@ public void testBasicForkError() {
        assertThat(pe.getMessage(), containsString("mismatched input 'me' expecting INTEGER_LITERAL"));
    }

+    public void testRandomSampleProbability() {
+        var e = expectThrows(VerificationException.class, () -> analyze("FROM test | RANDOM_SAMPLE 1."));
+        assertThat(e.getMessage(), containsString("RandomSampling probability must be strictly between 0.0 and 1.0, was [1.0]"));


RandomSampling -> RANDOM_SAMPLE

This error originates in the RandomSamplingQuery. I guess we could change it, I think we don't offer BWC over messages in errors (and there's no preexisting test for that query and this message).

jan-elastic · 2025-03-10T15:09:29Z

...ugin/esql/compute/src/main/java/org/elasticsearch/compute/operator/RandomSampleOperator.java

+    @Override
+    public void addInput(Page page) {
+        final var addStart = System.nanoTime();
+        collect(page);


Why do you collect unsampled pages?

When a page arrives, you can immediately sample that page, and emit a sampled page. No need to keep anything in memory. (Or is it inefficient to output small pages?)

This should lead to much less memory pressure.

If you want output page size comparable to input page size, I would sample pages right away, collect floor(1 / prob) sampled pages, and then emit the merged page.

If you want output page size comparable to input page size

Not really, I think this is no concern. It'd be slightly more efficient to compact rows in fewer pages, but not sure if this would make a measurable difference.

When a page arrives, you can immediately sample that page, and emit a sampled page

Sampling over a low number of rows batch, which one single page constitutes, can provide very skewed output: as in the example provided above, a 50% of every page will sum up over 50% of all rows in all those pages.

Laterally, this would be made more problematic, I think, if one doesn't have a probability to work with, but a total number of output rows, since the operator works over a stream of incoming pages and might not know how to sample to achieve the desired distribution. In that case caching the output might help, but it's still collecting the results in memory (not the input, but the output).

jan-elastic · 2025-03-10T15:16:32Z

x-pack/plugin/esql/src/main/antlr/lexer/Expression.g4

@@ -19,6 +19,7 @@ STATS : 'stats'               -> pushMode(EXPRESSION_MODE);
 WHERE : 'where'               -> pushMode(EXPRESSION_MODE);

 DEV_INLINESTATS : {this.isDevVersion()}? 'inlinestats'   -> pushMode(EXPRESSION_MODE);
+DEV_RANDOM_SAMPLE : {this.isDevVersion()}? 'random_sample' -> pushMode(EXPRESSION_MODE);


I would call it just SAMPLE instead of RANDOM_SAMPLE. Curious what others think...

jan-elastic · 2025-03-10T15:46:41Z

If it can't be pushed to Lucene it looks like this will randomly sample the output rows from the previous steps. That's what I'd imagine for stuff like
FROM foo
| STATS a BY b, c, d
| RANDOM SAMPLE 0.01
But that's different than sampling the underlying docs. I think. Are we linking the behavior to how the optimization works?

I think that's the correct behavior: this should randomly sample to output rows.

So:

FROM index | STATS a BY b | SAMPLE 0.1

is very different from

FROM index | SAMPLE 0.1 | STATS a BY b

jan-elastic · 2025-03-10T15:59:40Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/RandomSample.java

+
+import static org.elasticsearch.xpack.esql.common.Failure.fail;
+
+public class RandomSample extends UnaryPlan implements TelemetryAware, PostAnalysisVerificationAware {


The existing Random sampler aggregation has a notion of what aggs it's running together with sample.

E.g.

GET kibana_sample_data_ecommerce/_search { "size": 0, "aggs": { "sum_agg": { "sum": { "field": "taxful_total_price" } }, "avg_agg": { "avg": { "field": "taxful_total_price" } } } }

gives

"aggregations": { "sum_agg": { "value": 350884.12890625 }, "avg_agg": { "value": 75.05542864304813 } }

and the sampled agg:

GET kibana_sample_data_ecommerce/_search { "size": 0, "aggs": { "sampling": { "random_sampler": { "probability": 0.1 }, "aggs": { "sum_agg": { "sum": { "field": "taxful_total_price" } }, "avg_agg": { "avg": { "field": "taxful_total_price" } } } } } }

gives

"aggregations": { "sampling": { "seed": 202761359, "probability": 0.1, "doc_count": 434, "avg_agg": { "value": 76.85233654953917 }, "sum_agg": { "value": 333539.140625 } } }

In words: the sampled values are approximations of the correct values.

Instead, in ES|QL

FROM kibana_sample_data_ecommerce | SAMPLE 0.1 | STATS SUM(taxful_total_price)

gives an approximation of 0.1 times the sum of the prices. (And you have to scale up manually afterwards.)

Is this a problem? If so, how do we want to resolve that?

Indeed, the implementation is agnostic of anything that might follow the sampling. We'd need subsequent work for the aggs to adjust their output (potentially different command, function or parameter, as well?).

felixbarny · 2025-03-10T16:19:02Z

As a user, it's often difficult to choose a proper sampling rate that works well with different data sets. Have you thought about Elasticsearch being able to automatically choose the sample rate depending on how many docs match the query, to get a result with an acceptable accuracy? The input could be the desired level of accuracy or the maximum number of docs that the query should evaluate.

That would be similar to what we're doing in profiling where there's a multi-stage process to the query. At first, the query is run with a very low sample rate. Based on the results, the sample rate is estimated that's needed to examine at least 20k docs, which is considered to be statistically significant. Then, the query is re-run with the estimated sample rate.

jan-elastic · 2025-03-11T09:22:41Z

Hey @felixbarny , thanks for your input as a user. That's very valuable.

If I understand correctly, it would really help if there was a variant like this:

FROM data | SAMPLE 20000 | STATS ...

which samples up to 20000 documents, dropping the remainder?

Specifying an accuracy might be a lot harder and probably depends heavily on the stats that are accumulated.

felixbarny · 2025-03-11T10:21:20Z

FROM data | SAMPLE 20000 | STATS ...

Yeah, something like that would be really useful, I think.

ivancea · 2025-03-11T10:56:31Z

x-pack/plugin/esql/gen/EsqlBaseLexer.tokens

What is this file?

It gets generated, so I've added it.

ivancea · 2025-03-11T11:12:07Z

...ava/org/elasticsearch/xpack/esql/optimizer/rules/logical/PushDownAndCombineRandomSample.java

+            var seed = combinedSeed(context, randomSample, rsChild);
+            plan = new RandomSample(randomSample.source(), probability, seed, rsChild.child());
+        } else if (child instanceof Enrich
+            || child instanceof Eval


Both WHERE and SORT reordering would change the query output, considering that RANDOM_SAMPLE is deterministic, and if multi-node queries and Lucene were all deterministic or forced-to-be through a SORT/STATS.

Not sure if that's the idea here though, but it feels like a relevant decision. That said, I'm not sure if the push down to lucene will have the same results. So talking about determinism here is tricky...

ivancea · 2025-03-11T11:38:51Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/RandomSample.java

+    public RandomSample(Source source, Expression probability, @Nullable Expression seed, LogicalPlan child) {
+        super(source, child);
+        this.probability = probability;
+        this.seed = seed;


I wonder if this seed should be fixed at planning time, instead of doing it in the LocalExecutionPlanner or in the query builder.
Otherwise, it may potentially be different in each node, (if this ends up inside the fragment, not sure if possible), while if chosen by the user, it will be the same

I think that this is the first randomness feature we have in ESQL? Should we maybe have a seed in the Configuration? Just speaking loud here

Otherwise, it may potentially be different in each node

That shouldn't be an issue, if that occurs predictably, right? Moreover, there is the non-deterministic factor of how the results are sink'ed together, coming from different nodes (the issue we have with just a FROM foo, actually).

But, I have too wondered if this shouldn't be fixed lower in the planning. My alternative was however to put it straight in the query assembling already (LogicalPlanBuilder).

Should we maybe have a seed in the Configuration?

That seems a bit too generic for this specific need, IMO.
We might want a | SORT RANDOM() function that might not feed on the same generator.

ivancea · 2025-03-11T11:45:30Z

x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/analysis/AnalyzerTestUtils.java

@@ -187,4 +189,13 @@ public static void loadEnrichPolicyResolution(EnrichResolution enrich, String po
    public static IndexResolution tsdbIndexResolution() {
        return loadMapping("tsdb-mapping.json", "test");
    }
+
+    public static <E> E randomValueOtherThanTest(Predicate<E> exclude, Supplier<E> supplier) {


Can we use randomValueOtherThanMany() from ESTestCase instead?

Unfortunately not, since the predicate tests against a floating point range, not discrete values, as randomValueOtherThanMany() requires.

nik9000 · 2025-03-11T12:18:34Z

I think that's the correct behavior: this should randomly sample to output rows.

So:
FROM index | STATS a BY b | SAMPLE 0.1
is very different from
FROM index | SAMPLE 0.1 | STATS a BY b

I wonder if:

FROM index | WHERE a == 1 | SAMPLE 0.1

is different from

FROM index | SAMPLE 0.1 | WHERE a == 1

Do we want that too? I'd have to play with it to figure out what's happening in this version. Like, it's normal for us to shift WHERE in the query, but shifting it to the right of SAMPLE could change the meaning. When we're doing _search I think we're doing the bottom one, right?

jan-elastic · 2025-03-11T12:30:53Z

I wonder if:

FROM index | WHERE a == 1 | SAMPLE 0.1

is different from

FROM index | SAMPLE 0.1 | WHERE a == 1

See discussion here

rockdaboot · 2025-03-13T12:59:11Z

Hey @felixbarny , thanks for your input as a user. That's very valuable.

If I understand correctly, it would really help if there was a variant like this:
FROM data | SAMPLE 20000 | STATS ...
which samples up to 20000 documents, dropping the remainder?

Ideally not dropping the remainder, just calculating the probability automatically, so that the result set contains ~20000. For the use case that @felixbarny likely referred to (profiling), the exact number isn't relevant.

felixbarny · 2025-03-13T13:08:12Z

I think the use case is relevant beyond profiling but we should take inspiration from the approach that profiling is doing. One example would be applying log categorization on a sample of logs. But instead of having to specify the sampling percentage, which depends on how many logs match the condition, which is not known upfront, just being able to specify the amount of docs that I think would give me a representative result in a reasonable time (assuming the number of docs correlates with both accuracy and latency).

bpintea · 2025-04-23T15:56:36Z

Superseded by #125570.

Add a random sample commadn

36acb35

This adds RANDOM_SAMPLE <probability> <seed>? command.

bpintea added >feature :Analytics/ES|QL AKA ESQL labels Mar 3, 2025

elasticsearchmachine added the v9.1.0 label Mar 3, 2025

Update docs/changelog/123879.yaml

7abf28d

elasticsearchmachine and others added 7 commits March 3, 2025 14:15

[CI] Auto commit changes from spotless

15afc08

Add non-operator-related tests

3cbd508

Make seed parameter optional. Various fixes

0f9600c

Merge remote-tracking branch 'upstream/main' into feat/random_sample

e5de4bf

[CI] Auto commit changes from spotless

213634e

Merge remote-tracking branch 'upstream/main' into feat/random_sample

8fbc684

Make CsvTests more node-count-induced variation tollerant

0063737

nik9000 reviewed Mar 10, 2025

View reviewed changes

jan-elastic reviewed Mar 10, 2025

View reviewed changes

ivancea reviewed Mar 11, 2025

View reviewed changes

bpintea closed this Apr 23, 2025

bpintea deleted the feat/random_sample branch April 23, 2025 15:56

		@@ -0,0 +1,220 @@
		// Tests focused on the RANDOM_SAMPLE command


		import static org.elasticsearch.xpack.esql.common.Failure.fail;

		public class RandomSample extends UnaryPlan implements TelemetryAware, PostAnalysisVerificationAware {

ESQL: Add a random sample command #123879

ESQL: Add a random sample command #123879

Uh oh!

Conversation

bpintea commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Mar 3, 2025

Uh oh!

nik9000 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jan-elastic left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jan-elastic Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jan-elastic Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jan-elastic commented Mar 10, 2025

Uh oh!

jan-elastic Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixbarny commented Mar 10, 2025

Uh oh!

jan-elastic commented Mar 11, 2025

Uh oh!

felixbarny commented Mar 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bpintea commented Mar 3, 2025 •

edited

Loading

jan-elastic left a comment •

edited

Loading

jan-elastic Mar 10, 2025 •

edited

Loading

jan-elastic Mar 10, 2025 •

edited

Loading

jan-elastic Mar 10, 2025 •

edited

Loading