-
Notifications
You must be signed in to change notification settings - Fork 25.3k
ESQL: Add a random sample command #123879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This adds RANDOM_SAMPLE <probability> <seed>? command.
Hi @bpintea, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it can't be pushed to Lucene it looks like this will randomly sample the output rows from the previous steps. That's what I'd imagine for stuff like
FROM foo
| STATS a BY b, c, d
| RANDOM SAMPLE 0.01
But that's different than sampling the underlying docs. I think. Are we linking the behavior to how the optimization works?
} | ||
} | ||
return new Page(filteredBlocks); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
private final int seed; | ||
|
||
private boolean collecting = true; | ||
private boolean isFinished = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if these should be an enum
with three values. Dunno.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave this a quick first pass. I think it looks great, and contains all the necessary pieces that I had in mind. Nice job!
I've left a few comments and will revisit it in more details later this week.
@@ -0,0 +1,220 @@ | |||
// Tests focused on the RANDOM_SAMPLE command |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following test fails
FROM data | SORT @timestamp | RANDOM_SAMPLE 0.1 | LIMIT 1000
with
Unbounded sort not supported yet [SORT @timestamp] please add a limit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this unbounded sort is a current limitation with ESQL, nothing specific to this command (same happens if you swap RANDOM_SAMPLE with STATS, or other command that LIMIT cannot be pushed down by).
var seed = combinedSeed(context, randomSample, rsChild); | ||
plan = new RandomSample(randomSample.source(), probability, seed, rsChild.child()); | ||
} else if (child instanceof Enrich | ||
|| child instanceof Eval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about SORT
?
I think you can also swap SORT
and RANDOM_SAMPLE
, and sorting less data is more efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That probably also solves the failing
FROM data | SORT @timestamp | RANDOM_SAMPLE 0.1 | LIMIT 1000
If you don't push down the RANDOM_SAMPLE
, you need the SORT
needs to be a TOP_N
with N=limit/random_sample_p
or so.
var seed = combinedSeed(context, randomSample, rsChild); | ||
plan = new RandomSample(randomSample.source(), probability, seed, rsChild.child()); | ||
} else if (child instanceof Enrich | ||
|| child instanceof Eval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about WHERE
?
I think you can swap WHERE
and RANDOM_SAMPLE
, and I think it's beneficial to move RANDOM_SAMPLE
to the ES query if possible. E.g.
FROM data | WHERE SUBSTRING(message, 0, 4) == "test" | RANDOM_SAMPLE 0.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both WHERE and SORT reordering would change the query output, considering that RANDOM_SAMPLE is deterministic, and if multi-node queries and Lucene were all deterministic or forced-to-be through a SORT/STATS.
Not sure if that's the idea here though, but it feels like a relevant decision. That said, I'm not sure if the push down to lucene will have the same results. So talking about determinism here is tricky...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we ignore the determinism / reproducibility with fixed seeds, you should be able to swap them.
Regarding WHERE
: regardless of the ordering, the end up when a random 10% of rows matching the WHERE
.
Regarding SORT
: regardless of the ordering, you get a random 10% of the rows in sorted order.
Making sure everything is reproducible when you specify a fixed seed, sounds like a (possibly hard) implementation detail, but not something that dictates whether you can push down or not.
Thinking out loud: if that's really hard, we can also consider not implementing the fixed seed at all for now. (Have to discuss that with product of course.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A general guideline of command reordering / optimisation is to only do it if it doesn't alter the outcome (maybe with the exception of chaining the sampling, as you're raising the point below). While in theory randomly choosing and sorting will produce conceptually a similar result, it wouldn't be the same one.
A similar guideline is to have the same results irrespective of a set of the operations being pushed down or not (i.e. "disabled optimisations").
Meaning that WHERE
and SORT
would not slide past sampling.
Regarding WHERE: regardless of the ordering, the end up when a random 10% of rows matching the WHERE.
Yes, but can be a totally different set that's sampled -- think of | WHERE foo IS NOT NULL
.
Regarding SORT: regardless of the ordering, you get a random 10% of the rows in sorted order.
Again, | SORT ... NULLS LAST
.
Nothing specific to the NULLs, just how the pool of rows to sample forms.
if (parentSeed != null) { | ||
if (childSeed != null) { | ||
var seedValue = (int) Foldables.valueOf(context.foldCtx(), parentSeed); | ||
seedValue ^= (int) Foldables.valueOf(context.foldCtx(), childSeed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this give identical results to executing the two individual RANDOM_SAMPLE
s?
If not, is that a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this give identical results to executing the two individual RANDOM_SAMPLEs?
Likely not: currently, both implementations of the sampling - pushed down to RandomSamplingQuery
or "kept" in the exec engine, in RandomSampleOperator
- the sampling occurs over a limited number of documents, irrespective of the total set of available documents. In the former case, the limit is the total docs available in a segment (up to int's max - 1), the latter, it's a config option (10K). Intuitively, I think the lower the given probability, the better the sampling will be if run over a large number of docs (if sampling with 50% over multiple batches, be those segments or rows in pages, you will end up with more than 50% of all batches/rows).
And compounding multiple sampling operations will lower the probability. However, ...
If not, is that a problem?
...I would think that in practice it's not, if sampling over a large document set, but just due to the statistical nature of the operation. However, I realise this is debatable.
Literal seed; | ||
if (ctx.seed != null) { | ||
seed = visitIntegerValue(ctx.seed); | ||
if (seed.dataType() != DataType.INTEGER) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't visitIntegerValue
do this check? If not, why not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "integer value" is non-floating point value, which includes UNSIGNED_LONG
.
@@ -2891,6 +2892,18 @@ public void testBasicForkError() { | |||
assertThat(pe.getMessage(), containsString("mismatched input 'me' expecting INTEGER_LITERAL")); | |||
} | |||
|
|||
public void testRandomSampleProbability() { | |||
var e = expectThrows(VerificationException.class, () -> analyze("FROM test | RANDOM_SAMPLE 1.")); | |||
assertThat(e.getMessage(), containsString("RandomSampling probability must be strictly between 0.0 and 1.0, was [1.0]")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RandomSampling
-> RANDOM_SAMPLE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error originates in the RandomSamplingQuery
. I guess we could change it, I think we don't offer BWC over messages in errors (and there's no preexisting test for that query and this message).
@Override | ||
public void addInput(Page page) { | ||
final var addStart = System.nanoTime(); | ||
collect(page); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you collect unsampled pages?
When a page arrives, you can immediately sample that page, and emit a sampled page. No need to keep anything in memory. (Or is it inefficient to output small pages?)
This should lead to much less memory pressure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want output page size comparable to input page size, I would sample pages right away, collect floor(1 / prob)
sampled pages, and then emit the merged page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want output page size comparable to input page size
Not really, I think this is no concern. It'd be slightly more efficient to compact rows in fewer pages, but not sure if this would make a measurable difference.
When a page arrives, you can immediately sample that page, and emit a sampled page
Sampling over a low number of rows batch, which one single page constitutes, can provide very skewed output: as in the example provided above, a 50% of every page will sum up over 50% of all rows in all those pages.
Laterally, this would be made more problematic, I think, if one doesn't have a probability to work with, but a total number of output rows, since the operator works over a stream of incoming pages and might not know how to sample to achieve the desired distribution. In that case caching the output might help, but it's still collecting the results in memory (not the input, but the output).
@@ -19,6 +19,7 @@ STATS : 'stats' -> pushMode(EXPRESSION_MODE); | |||
WHERE : 'where' -> pushMode(EXPRESSION_MODE); | |||
|
|||
DEV_INLINESTATS : {this.isDevVersion()}? 'inlinestats' -> pushMode(EXPRESSION_MODE); | |||
DEV_RANDOM_SAMPLE : {this.isDevVersion()}? 'random_sample' -> pushMode(EXPRESSION_MODE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would call it just SAMPLE
instead of RANDOM_SAMPLE
. Curious what others think...
I think that's the correct behavior: this should randomly sample to output rows. So:
is very different from
|
|
||
import static org.elasticsearch.xpack.esql.common.Failure.fail; | ||
|
||
public class RandomSample extends UnaryPlan implements TelemetryAware, PostAnalysisVerificationAware { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing Random sampler aggregation has a notion of what aggs it's running together with sample.
E.g.
GET kibana_sample_data_ecommerce/_search
{
"size": 0,
"aggs": {
"sum_agg": {
"sum": {
"field": "taxful_total_price"
}
},
"avg_agg": {
"avg": {
"field": "taxful_total_price"
}
}
}
}
gives
"aggregations": {
"sum_agg": {
"value": 350884.12890625
},
"avg_agg": {
"value": 75.05542864304813
}
}
and the sampled agg:
GET kibana_sample_data_ecommerce/_search
{
"size": 0,
"aggs": {
"sampling": {
"random_sampler": {
"probability": 0.1
},
"aggs": {
"sum_agg": {
"sum": {
"field": "taxful_total_price"
}
},
"avg_agg": {
"avg": {
"field": "taxful_total_price"
}
}
}
}
}
}
gives
"aggregations": {
"sampling": {
"seed": 202761359,
"probability": 0.1,
"doc_count": 434,
"avg_agg": {
"value": 76.85233654953917
},
"sum_agg": {
"value": 333539.140625
}
}
}
In words: the sampled values are approximations of the correct values.
Instead, in ES|QL
FROM kibana_sample_data_ecommerce | SAMPLE 0.1 | STATS SUM(taxful_total_price)
gives an approximation of 0.1 times the sum of the prices. (And you have to scale up manually afterwards.)
Is this a problem? If so, how do we want to resolve that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, the implementation is agnostic of anything that might follow the sampling. We'd need subsequent work for the aggs to adjust their output (potentially different command, function or parameter, as well?).
As a user, it's often difficult to choose a proper sampling rate that works well with different data sets. Have you thought about Elasticsearch being able to automatically choose the sample rate depending on how many docs match the query, to get a result with an acceptable accuracy? The input could be the desired level of accuracy or the maximum number of docs that the query should evaluate. That would be similar to what we're doing in profiling where there's a multi-stage process to the query. At first, the query is run with a very low sample rate. Based on the results, the sample rate is estimated that's needed to examine at least 20k docs, which is considered to be statistically significant. Then, the query is re-run with the estimated sample rate. |
Hey @felixbarny , thanks for your input as a user. That's very valuable. If I understand correctly, it would really help if there was a variant like this:
which samples up to 20000 documents, dropping the remainder? Specifying an accuracy might be a lot harder and probably depends heavily on the stats that are accumulated. |
Yeah, something like that would be really useful, I think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It gets generated, so I've added it.
var seed = combinedSeed(context, randomSample, rsChild); | ||
plan = new RandomSample(randomSample.source(), probability, seed, rsChild.child()); | ||
} else if (child instanceof Enrich | ||
|| child instanceof Eval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both WHERE and SORT reordering would change the query output, considering that RANDOM_SAMPLE is deterministic, and if multi-node queries and Lucene were all deterministic or forced-to-be through a SORT/STATS.
Not sure if that's the idea here though, but it feels like a relevant decision. That said, I'm not sure if the push down to lucene will have the same results. So talking about determinism here is tricky...
public RandomSample(Source source, Expression probability, @Nullable Expression seed, LogicalPlan child) { | ||
super(source, child); | ||
this.probability = probability; | ||
this.seed = seed; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this seed should be fixed at planning time, instead of doing it in the LocalExecutionPlanner
or in the query builder.
Otherwise, it may potentially be different in each node, (if this ends up inside the fragment, not sure if possible), while if chosen by the user, it will be the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this is the first randomness feature we have in ESQL? Should we maybe have a seed in the Configuration? Just speaking loud here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, it may potentially be different in each node
That shouldn't be an issue, if that occurs predictably, right? Moreover, there is the non-deterministic factor of how the results are sink'ed together, coming from different nodes (the issue we have with just a FROM foo
, actually).
But, I have too wondered if this shouldn't be fixed lower in the planning. My alternative was however to put it straight in the query assembling already (LogicalPlanBuilder
).
Should we maybe have a seed in the Configuration?
That seems a bit too generic for this specific need, IMO.
We might want a | SORT RANDOM()
function that might not feed on the same generator.
@@ -187,4 +189,13 @@ public static void loadEnrichPolicyResolution(EnrichResolution enrich, String po | |||
public static IndexResolution tsdbIndexResolution() { | |||
return loadMapping("tsdb-mapping.json", "test"); | |||
} | |||
|
|||
public static <E> E randomValueOtherThanTest(Predicate<E> exclude, Supplier<E> supplier) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use randomValueOtherThanMany()
from ESTestCase
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately not, since the predicate tests against a floating point range, not discrete values, as randomValueOtherThanMany()
requires.
I wonder if:
is different from
Do we want that too? I'd have to play with it to figure out what's happening in this version. Like, it's normal for us to shift |
See discussion here |
Ideally not dropping the remainder, just calculating the probability automatically, so that the result set contains ~20000. For the use case that @felixbarny likely referred to (profiling), the exact number isn't relevant. |
I think the use case is relevant beyond profiling but we should take inspiration from the approach that profiling is doing. One example would be applying log categorization on a sample of logs. But instead of having to specify the sampling percentage, which depends on how many logs match the condition, which is not known upfront, just being able to specify the amount of docs that I think would give me a representative result in a reasonable time (assuming the number of docs correlates with both accuracy and latency). |
Superseded by #125570. |
This adds a new command,
RANDOM_SAMPLE <probability> <seed>
. The command is producing at output a random selection of the rows it received on the input, maintaining their input order.It receives a probability as a first mandatory parameter, a double in the
(0.0, 1.0)
interval, and an optional seed, an integer value. The former determines the selectivity of the sampling, the latter its reproducibility. If no seed is provided, a random one will be used.The command will be "pushed down" to Lucene whenever possible, using the existing
RandomSamplingQuery
(used by the Random sampler aggregation).This is as usually the most efficient way to execute the command since this controls which documents will be considered for loading (or further filtering) and it will also run the selection over an entire Lucene segment.
When no "push down" is possible, an ESQL operator caches a number of input pages up to a (minimum) number of rows
(currently equal to the configured maximum result set size) and then samples this batch. Meaning that (1) all the cached pages are first collected in memory (i.e. high memory pressure) and (2), the sampling runs over a significantly reduced document ID space, compared to that of a segment's (i.e. poor sampling with low probabilities).