Skip to content

ESQL: Add a random sample command #123879

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from

Conversation

bpintea
Copy link
Contributor

@bpintea bpintea commented Mar 3, 2025

This adds a new command, RANDOM_SAMPLE <probability> <seed>. The command is producing at output a random selection of the rows it received on the input, maintaining their input order.

It receives a probability as a first mandatory parameter, a double in the (0.0, 1.0) interval, and an optional seed, an integer value. The former determines the selectivity of the sampling, the latter its reproducibility. If no seed is provided, a random one will be used.

The command will be "pushed down" to Lucene whenever possible, using the existing RandomSamplingQuery (used by the Random sampler aggregation).
This is as usually the most efficient way to execute the command since this controls which documents will be considered for loading (or further filtering) and it will also run the selection over an entire Lucene segment.

When no "push down" is possible, an ESQL operator caches a number of input pages up to a (minimum) number of rows
(currently equal to the configured maximum result set size) and then samples this batch. Meaning that (1) all the cached pages are first collected in memory (i.e. high memory pressure) and (2), the sampling runs over a significantly reduced document ID space, compared to that of a segment's (i.e. poor sampling with low probabilities).

This adds RANDOM_SAMPLE <probability> <seed>? command.
@elasticsearchmachine
Copy link
Collaborator

Hi @bpintea, I've created a changelog YAML for you.

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it can't be pushed to Lucene it looks like this will randomly sample the output rows from the previous steps. That's what I'd imagine for stuff like

FROM foo
| STATS a BY b, c, d
| RANDOM SAMPLE 0.01

But that's different than sampling the underlying docs. I think. Are we linking the behavior to how the optimization works?

}
}
return new Page(filteredBlocks);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

private final int seed;

private boolean collecting = true;
private boolean isFinished = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if these should be an enum with three values. Dunno.

Copy link
Contributor

@jan-elastic jan-elastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this a quick first pass. I think it looks great, and contains all the necessary pieces that I had in mind. Nice job!

I've left a few comments and will revisit it in more details later this week.

@@ -0,0 +1,220 @@
// Tests focused on the RANDOM_SAMPLE command
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following test fails

FROM data | SORT @timestamp | RANDOM_SAMPLE 0.1 | LIMIT 1000

with

Unbounded sort not supported yet [SORT @timestamp] please add a limit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this unbounded sort is a current limitation with ESQL, nothing specific to this command (same happens if you swap RANDOM_SAMPLE with STATS, or other command that LIMIT cannot be pushed down by).

var seed = combinedSeed(context, randomSample, rsChild);
plan = new RandomSample(randomSample.source(), probability, seed, rsChild.child());
} else if (child instanceof Enrich
|| child instanceof Eval
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about SORT?

I think you can also swap SORT and RANDOM_SAMPLE, and sorting less data is more efficient.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That probably also solves the failing

FROM data | SORT @timestamp | RANDOM_SAMPLE 0.1 | LIMIT 1000

If you don't push down the RANDOM_SAMPLE, you need the SORT needs to be a TOP_N with N=limit/random_sample_p or so.

var seed = combinedSeed(context, randomSample, rsChild);
plan = new RandomSample(randomSample.source(), probability, seed, rsChild.child());
} else if (child instanceof Enrich
|| child instanceof Eval
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about WHERE?

I think you can swap WHERE and RANDOM_SAMPLE, and I think it's beneficial to move RANDOM_SAMPLE to the ES query if possible. E.g.

FROM data | WHERE SUBSTRING(message, 0, 4) == "test" | RANDOM_SAMPLE 0.1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both WHERE and SORT reordering would change the query output, considering that RANDOM_SAMPLE is deterministic, and if multi-node queries and Lucene were all deterministic or forced-to-be through a SORT/STATS.

Not sure if that's the idea here though, but it feels like a relevant decision. That said, I'm not sure if the push down to lucene will have the same results. So talking about determinism here is tricky...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we ignore the determinism / reproducibility with fixed seeds, you should be able to swap them.

Regarding WHERE: regardless of the ordering, the end up when a random 10% of rows matching the WHERE.

Regarding SORT: regardless of the ordering, you get a random 10% of the rows in sorted order.

Making sure everything is reproducible when you specify a fixed seed, sounds like a (possibly hard) implementation detail, but not something that dictates whether you can push down or not.

Thinking out loud: if that's really hard, we can also consider not implementing the fixed seed at all for now. (Have to discuss that with product of course.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general guideline of command reordering / optimisation is to only do it if it doesn't alter the outcome (maybe with the exception of chaining the sampling, as you're raising the point below). While in theory randomly choosing and sorting will produce conceptually a similar result, it wouldn't be the same one.
A similar guideline is to have the same results irrespective of a set of the operations being pushed down or not (i.e. "disabled optimisations").
Meaning that WHERE and SORT would not slide past sampling.

Regarding WHERE: regardless of the ordering, the end up when a random 10% of rows matching the WHERE.

Yes, but can be a totally different set that's sampled -- think of | WHERE foo IS NOT NULL.

Regarding SORT: regardless of the ordering, you get a random 10% of the rows in sorted order.

Again, | SORT ... NULLS LAST.

Nothing specific to the NULLs, just how the pool of rows to sample forms.

if (parentSeed != null) {
if (childSeed != null) {
var seedValue = (int) Foldables.valueOf(context.foldCtx(), parentSeed);
seedValue ^= (int) Foldables.valueOf(context.foldCtx(), childSeed);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this give identical results to executing the two individual RANDOM_SAMPLEs?

If not, is that a problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this give identical results to executing the two individual RANDOM_SAMPLEs?

Likely not: currently, both implementations of the sampling - pushed down to RandomSamplingQuery or "kept" in the exec engine, in RandomSampleOperator - the sampling occurs over a limited number of documents, irrespective of the total set of available documents. In the former case, the limit is the total docs available in a segment (up to int's max - 1), the latter, it's a config option (10K). Intuitively, I think the lower the given probability, the better the sampling will be if run over a large number of docs (if sampling with 50% over multiple batches, be those segments or rows in pages, you will end up with more than 50% of all batches/rows).
And compounding multiple sampling operations will lower the probability. However, ...

If not, is that a problem?

...I would think that in practice it's not, if sampling over a large document set, but just due to the statistical nature of the operation. However, I realise this is debatable.

Literal seed;
if (ctx.seed != null) {
seed = visitIntegerValue(ctx.seed);
if (seed.dataType() != DataType.INTEGER) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't visitIntegerValue do this check? If not, why not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "integer value" is non-floating point value, which includes UNSIGNED_LONG.

@@ -2891,6 +2892,18 @@ public void testBasicForkError() {
assertThat(pe.getMessage(), containsString("mismatched input 'me' expecting INTEGER_LITERAL"));
}

public void testRandomSampleProbability() {
var e = expectThrows(VerificationException.class, () -> analyze("FROM test | RANDOM_SAMPLE 1."));
assertThat(e.getMessage(), containsString("RandomSampling probability must be strictly between 0.0 and 1.0, was [1.0]"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RandomSampling -> RANDOM_SAMPLE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error originates in the RandomSamplingQuery. I guess we could change it, I think we don't offer BWC over messages in errors (and there's no preexisting test for that query and this message).

@Override
public void addInput(Page page) {
final var addStart = System.nanoTime();
collect(page);
Copy link
Contributor

@jan-elastic jan-elastic Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you collect unsampled pages?

When a page arrives, you can immediately sample that page, and emit a sampled page. No need to keep anything in memory. (Or is it inefficient to output small pages?)

This should lead to much less memory pressure.

Copy link
Contributor

@jan-elastic jan-elastic Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want output page size comparable to input page size, I would sample pages right away, collect floor(1 / prob) sampled pages, and then emit the merged page.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want output page size comparable to input page size

Not really, I think this is no concern. It'd be slightly more efficient to compact rows in fewer pages, but not sure if this would make a measurable difference.

When a page arrives, you can immediately sample that page, and emit a sampled page

Sampling over a low number of rows batch, which one single page constitutes, can provide very skewed output: as in the example provided above, a 50% of every page will sum up over 50% of all rows in all those pages.

Laterally, this would be made more problematic, I think, if one doesn't have a probability to work with, but a total number of output rows, since the operator works over a stream of incoming pages and might not know how to sample to achieve the desired distribution. In that case caching the output might help, but it's still collecting the results in memory (not the input, but the output).

@@ -19,6 +19,7 @@ STATS : 'stats' -> pushMode(EXPRESSION_MODE);
WHERE : 'where' -> pushMode(EXPRESSION_MODE);

DEV_INLINESTATS : {this.isDevVersion()}? 'inlinestats' -> pushMode(EXPRESSION_MODE);
DEV_RANDOM_SAMPLE : {this.isDevVersion()}? 'random_sample' -> pushMode(EXPRESSION_MODE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call it just SAMPLE instead of RANDOM_SAMPLE. Curious what others think...

@jan-elastic
Copy link
Contributor

If it can't be pushed to Lucene it looks like this will randomly sample the output rows from the previous steps. That's what I'd imagine for stuff like

FROM foo
| STATS a BY b, c, d
| RANDOM SAMPLE 0.01

But that's different than sampling the underlying docs. I think. Are we linking the behavior to how the optimization works?

I think that's the correct behavior: this should randomly sample to output rows.

So:

FROM index | STATS a BY b | SAMPLE 0.1

is very different from

FROM index | SAMPLE 0.1 | STATS a BY b


import static org.elasticsearch.xpack.esql.common.Failure.fail;

public class RandomSample extends UnaryPlan implements TelemetryAware, PostAnalysisVerificationAware {
Copy link
Contributor

@jan-elastic jan-elastic Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing Random sampler aggregation has a notion of what aggs it's running together with sample.

E.g.

GET kibana_sample_data_ecommerce/_search
{
  "size": 0,
  "aggs": {
    "sum_agg": {
      "sum": {
        "field": "taxful_total_price"
      }
    },
    "avg_agg": {
      "avg": {
        "field": "taxful_total_price"
      }
    }
  }
}

gives

  "aggregations": {
    "sum_agg": {
      "value": 350884.12890625
    },
    "avg_agg": {
      "value": 75.05542864304813
    }
  }

and the sampled agg:

GET kibana_sample_data_ecommerce/_search
{
  "size": 0,
  "aggs": {
    "sampling": {
      "random_sampler": {
        "probability": 0.1
      },
      "aggs": {
        "sum_agg": {
          "sum": {
            "field": "taxful_total_price"
          }
        },
        "avg_agg": {
          "avg": {
            "field": "taxful_total_price"
          }
        }
      }
    }
  }
}

gives

  "aggregations": {
    "sampling": {
      "seed": 202761359,
      "probability": 0.1,
      "doc_count": 434,
      "avg_agg": {
        "value": 76.85233654953917
      },
      "sum_agg": {
        "value": 333539.140625
      }
    }
  }

In words: the sampled values are approximations of the correct values.

Instead, in ES|QL

FROM kibana_sample_data_ecommerce | SAMPLE 0.1 | STATS SUM(taxful_total_price)

gives an approximation of 0.1 times the sum of the prices. (And you have to scale up manually afterwards.)

Is this a problem? If so, how do we want to resolve that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the implementation is agnostic of anything that might follow the sampling. We'd need subsequent work for the aggs to adjust their output (potentially different command, function or parameter, as well?).

@felixbarny
Copy link
Member

As a user, it's often difficult to choose a proper sampling rate that works well with different data sets. Have you thought about Elasticsearch being able to automatically choose the sample rate depending on how many docs match the query, to get a result with an acceptable accuracy? The input could be the desired level of accuracy or the maximum number of docs that the query should evaluate.

That would be similar to what we're doing in profiling where there's a multi-stage process to the query. At first, the query is run with a very low sample rate. Based on the results, the sample rate is estimated that's needed to examine at least 20k docs, which is considered to be statistically significant. Then, the query is re-run with the estimated sample rate.

@jan-elastic
Copy link
Contributor

Hey @felixbarny , thanks for your input as a user. That's very valuable.

If I understand correctly, it would really help if there was a variant like this:

FROM data | SAMPLE 20000 | STATS ...

which samples up to 20000 documents, dropping the remainder?

Specifying an accuracy might be a lot harder and probably depends heavily on the stats that are accumulated.

@felixbarny
Copy link
Member

FROM data | SAMPLE 20000 | STATS ...

Yeah, something like that would be really useful, I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It gets generated, so I've added it.

var seed = combinedSeed(context, randomSample, rsChild);
plan = new RandomSample(randomSample.source(), probability, seed, rsChild.child());
} else if (child instanceof Enrich
|| child instanceof Eval
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both WHERE and SORT reordering would change the query output, considering that RANDOM_SAMPLE is deterministic, and if multi-node queries and Lucene were all deterministic or forced-to-be through a SORT/STATS.

Not sure if that's the idea here though, but it feels like a relevant decision. That said, I'm not sure if the push down to lucene will have the same results. So talking about determinism here is tricky...

public RandomSample(Source source, Expression probability, @Nullable Expression seed, LogicalPlan child) {
super(source, child);
this.probability = probability;
this.seed = seed;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this seed should be fixed at planning time, instead of doing it in the LocalExecutionPlanner or in the query builder.
Otherwise, it may potentially be different in each node, (if this ends up inside the fragment, not sure if possible), while if chosen by the user, it will be the same

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this is the first randomness feature we have in ESQL? Should we maybe have a seed in the Configuration? Just speaking loud here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, it may potentially be different in each node

That shouldn't be an issue, if that occurs predictably, right? Moreover, there is the non-deterministic factor of how the results are sink'ed together, coming from different nodes (the issue we have with just a FROM foo, actually).

But, I have too wondered if this shouldn't be fixed lower in the planning. My alternative was however to put it straight in the query assembling already (LogicalPlanBuilder).

Should we maybe have a seed in the Configuration?

That seems a bit too generic for this specific need, IMO.
We might want a | SORT RANDOM() function that might not feed on the same generator.

@@ -187,4 +189,13 @@ public static void loadEnrichPolicyResolution(EnrichResolution enrich, String po
public static IndexResolution tsdbIndexResolution() {
return loadMapping("tsdb-mapping.json", "test");
}

public static <E> E randomValueOtherThanTest(Predicate<E> exclude, Supplier<E> supplier) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use randomValueOtherThanMany() from ESTestCase instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not, since the predicate tests against a floating point range, not discrete values, as randomValueOtherThanMany() requires.

@nik9000
Copy link
Member

nik9000 commented Mar 11, 2025

I think that's the correct behavior: this should randomly sample to output rows.

So:

FROM index | STATS a BY b | SAMPLE 0.1

is very different from

FROM index | SAMPLE 0.1 | STATS a BY b

I wonder if:

FROM index | WHERE a == 1 | SAMPLE 0.1

is different from

FROM index | SAMPLE 0.1 | WHERE a == 1

Do we want that too? I'd have to play with it to figure out what's happening in this version. Like, it's normal for us to shift WHERE in the query, but shifting it to the right of SAMPLE could change the meaning. When we're doing _search I think we're doing the bottom one, right?

@jan-elastic
Copy link
Contributor

I wonder if:

FROM index | WHERE a == 1 | SAMPLE 0.1

is different from

FROM index | SAMPLE 0.1 | WHERE a == 1

See discussion here

@rockdaboot
Copy link
Contributor

Hey @felixbarny , thanks for your input as a user. That's very valuable.

If I understand correctly, it would really help if there was a variant like this:

FROM data | SAMPLE 20000 | STATS ...

which samples up to 20000 documents, dropping the remainder?

Ideally not dropping the remainder, just calculating the probability automatically, so that the result set contains ~20000. For the use case that @felixbarny likely referred to (profiling), the exact number isn't relevant.

@felixbarny
Copy link
Member

I think the use case is relevant beyond profiling but we should take inspiration from the approach that profiling is doing. One example would be applying log categorization on a sample of logs. But instead of having to specify the sampling percentage, which depends on how many logs match the condition, which is not known upfront, just being able to specify the amount of docs that I think would give me a representative result in a reasonable time (assuming the number of docs correlates with both accuracy and latency).

@bpintea
Copy link
Contributor Author

bpintea commented Apr 23, 2025

Superseded by #125570.

@bpintea bpintea closed this Apr 23, 2025
@bpintea bpintea deleted the feat/random_sample branch April 23, 2025 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants