Skip to content

Commit c50cfa0

Browse files
authored
Add Boxplot Aggregation (#51948)
Adds a `boxplot` aggregation that calculates min, max, medium and the first and the third quartiles of the given data set. Closes #33112
1 parent e95cc14 commit c50cfa0

File tree

13 files changed

+1357
-3
lines changed

13 files changed

+1357
-3
lines changed
Lines changed: 185 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,185 @@
1+
[role="xpack"]
2+
[testenv="basic"]
3+
[[search-aggregations-metrics-boxplot-aggregation]]
4+
=== Boxplot Aggregation
5+
6+
A `boxplot` metrics aggregation that computes boxplot of numeric values extracted from the aggregated documents.
7+
These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.
8+
9+
The `boxplot` aggregation returns essential information for making a https://en.wikipedia.org/wiki/Box_plot[box plot]: minimum, maximum
10+
median, first quartile (25th percentile) and third quartile (75th percentile) values.
11+
12+
==== Syntax
13+
14+
A `boxplot` aggregation looks like this in isolation:
15+
16+
[source,js]
17+
--------------------------------------------------
18+
{
19+
"boxplot": {
20+
"field": "load_time"
21+
}
22+
}
23+
--------------------------------------------------
24+
// NOTCONSOLE
25+
26+
Let's look at a boxplot representing load time:
27+
28+
[source,console]
29+
--------------------------------------------------
30+
GET latency/_search
31+
{
32+
"size": 0,
33+
"aggs" : {
34+
"load_time_boxplot" : {
35+
"boxplot" : {
36+
"field" : "load_time" <1>
37+
}
38+
}
39+
}
40+
}
41+
--------------------------------------------------
42+
// TEST[setup:latency]
43+
<1> The field `load_time` must be a numeric field
44+
45+
The response will look like this:
46+
47+
[source,console-result]
48+
--------------------------------------------------
49+
{
50+
...
51+
52+
"aggregations": {
53+
"load_time_boxplot": {
54+
"min": 0.0,
55+
"max": 990.0,
56+
"q1": 165.0,
57+
"q2": 445.0,
58+
"q3": 725.0
59+
}
60+
}
61+
}
62+
--------------------------------------------------
63+
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
64+
65+
==== Script
66+
67+
The boxplot metric supports scripting. For example, if our load times
68+
are in milliseconds but we want values calculated in seconds, we could use
69+
a script to convert them on-the-fly:
70+
71+
[source,console]
72+
--------------------------------------------------
73+
GET latency/_search
74+
{
75+
"size": 0,
76+
"aggs" : {
77+
"load_time_boxplot" : {
78+
"boxplot" : {
79+
"script" : {
80+
"lang": "painless",
81+
"source": "doc['load_time'].value / params.timeUnit", <1>
82+
"params" : {
83+
"timeUnit" : 1000 <2>
84+
}
85+
}
86+
}
87+
}
88+
}
89+
}
90+
--------------------------------------------------
91+
// TEST[setup:latency]
92+
93+
<1> The `field` parameter is replaced with a `script` parameter, which uses the
94+
script to generate values which percentiles are calculated on
95+
<2> Scripting supports parameterized input just like any other script
96+
97+
This will interpret the `script` parameter as an `inline` script with the `painless` script language and no script parameters. To use a
98+
stored script use the following syntax:
99+
100+
[source,console]
101+
--------------------------------------------------
102+
GET latency/_search
103+
{
104+
"size": 0,
105+
"aggs" : {
106+
"load_time_boxplot" : {
107+
"boxplot" : {
108+
"script" : {
109+
"id": "my_script",
110+
"params": {
111+
"field": "load_time"
112+
}
113+
}
114+
}
115+
}
116+
}
117+
}
118+
--------------------------------------------------
119+
// TEST[setup:latency,stored_example_script]
120+
121+
[[search-aggregations-metrics-boxplot-aggregation-approximation]]
122+
==== Boxplot values are (usually) approximate
123+
124+
The algorithm used by the `boxplot` metric is called TDigest (introduced by
125+
Ted Dunning in
126+
https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf[Computing Accurate Quantiles using T-Digests]).
127+
128+
[WARNING]
129+
====
130+
Boxplot as other percentile aggregations are also
131+
https://en.wikipedia.org/wiki/Nondeterministic_algorithm[non-deterministic].
132+
This means you can get slightly different results using the same data.
133+
====
134+
135+
[[search-aggregations-metrics-boxplot-aggregation-compression]]
136+
==== Compression
137+
138+
Approximate algorithms must balance memory utilization with estimation accuracy.
139+
This balance can be controlled using a `compression` parameter:
140+
141+
[source,console]
142+
--------------------------------------------------
143+
GET latency/_search
144+
{
145+
"size": 0,
146+
"aggs" : {
147+
"load_time_boxplot" : {
148+
"boxplot" : {
149+
"field" : "load_time",
150+
"compression" : 200 <1>
151+
}
152+
}
153+
}
154+
}
155+
--------------------------------------------------
156+
// TEST[setup:latency]
157+
158+
<1> Compression controls memory usage and approximation error
159+
160+
include::percentile-aggregation.asciidoc[tags=t-digest]
161+
162+
==== Missing value
163+
164+
The `missing` parameter defines how documents that are missing a value should be treated.
165+
By default they will be ignored but it is also possible to treat them as if they
166+
had a value.
167+
168+
[source,console]
169+
--------------------------------------------------
170+
GET latency/_search
171+
{
172+
"size": 0,
173+
"aggs" : {
174+
"grade_boxplot" : {
175+
"boxplot" : {
176+
"field" : "grade",
177+
"missing": 10 <1>
178+
}
179+
}
180+
}
181+
}
182+
--------------------------------------------------
183+
// TEST[setup:latency]
184+
185+
<1> Documents without a value in the `grade` field will fall into the same bucket as documents that have the value `10`.

docs/reference/aggregations/metrics/percentile-aggregation.asciidoc

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -285,6 +285,7 @@ GET latency/_search
285285

286286
<1> Compression controls memory usage and approximation error
287287

288+
// tag::[t-digest]
288289
The TDigest algorithm uses a number of "nodes" to approximate percentiles -- the
289290
more nodes available, the higher the accuracy (and large memory footprint) proportional
290291
to the volume of data. The `compression` parameter limits the maximum number of
@@ -300,6 +301,7 @@ A "node" uses roughly 32 bytes of memory, so under worst-case scenarios (large a
300301
of data which arrives sorted and in-order) the default settings will produce a
301302
TDigest roughly 64KB in size. In practice data tends to be more random and
302303
the TDigest will use less memory.
304+
// tag::[t-digest]
303305

304306
==== HDR Histogram
305307

server/src/test/java/org/elasticsearch/search/aggregations/metrics/MinAggregatorTests.java

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -378,7 +378,7 @@ public void testGetProperty() throws IOException {
378378
iw.addDocument(singleton(new NumericDocValuesField("number", 7)));
379379
iw.addDocument(singleton(new NumericDocValuesField("number", 1)));
380380
}, (Consumer<InternalGlobal>) global -> {
381-
assertEquals(1.0, global.getDocCount(), 2);
381+
assertEquals(2, global.getDocCount());
382382
assertTrue(AggregationInspectionHelper.hasValue(global));
383383
assertNotNull(global.getAggregations().asMap().get("min"));
384384

x-pack/plugin/analytics/src/main/java/org/elasticsearch/xpack/analytics/AnalyticsPlugin.java

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,15 @@
77

88
import org.elasticsearch.action.ActionRequest;
99
import org.elasticsearch.action.ActionResponse;
10+
import org.elasticsearch.common.xcontent.ContextParser;
1011
import org.elasticsearch.index.mapper.Mapper;
1112
import org.elasticsearch.license.XPackLicenseState;
1213
import org.elasticsearch.plugins.ActionPlugin;
1314
import org.elasticsearch.plugins.MapperPlugin;
1415
import org.elasticsearch.plugins.Plugin;
1516
import org.elasticsearch.plugins.SearchPlugin;
17+
import org.elasticsearch.search.aggregations.AggregationBuilder;
18+
import org.elasticsearch.xpack.analytics.boxplot.InternalBoxplot;
1619
import org.elasticsearch.xpack.analytics.mapper.HistogramFieldMapper;
1720
import org.elasticsearch.xpack.core.XPackPlugin;
1821
import org.elasticsearch.xpack.core.action.XPackInfoFeatureAction;
@@ -21,6 +24,7 @@
2124
import org.elasticsearch.xpack.analytics.action.AnalyticsInfoTransportAction;
2225
import org.elasticsearch.xpack.analytics.action.AnalyticsUsageTransportAction;
2326
import org.elasticsearch.xpack.analytics.action.TransportAnalyticsStatsAction;
27+
import org.elasticsearch.xpack.analytics.boxplot.BoxplotAggregationBuilder;
2428
import org.elasticsearch.xpack.analytics.cumulativecardinality.CumulativeCardinalityPipelineAggregationBuilder;
2529
import org.elasticsearch.xpack.analytics.cumulativecardinality.CumulativeCardinalityPipelineAggregator;
2630
import org.elasticsearch.xpack.analytics.stringstats.InternalStringStats;
@@ -56,11 +60,16 @@ public List<PipelineAggregationSpec> getPipelineAggregations() {
5660

5761
@Override
5862
public List<AggregationSpec> getAggregations() {
59-
return singletonList(
63+
return Arrays.asList(
6064
new AggregationSpec(
6165
StringStatsAggregationBuilder.NAME,
6266
StringStatsAggregationBuilder::new,
63-
StringStatsAggregationBuilder::parse).addResultReader(InternalStringStats::new)
67+
StringStatsAggregationBuilder::parse).addResultReader(InternalStringStats::new),
68+
new AggregationSpec(
69+
BoxplotAggregationBuilder.NAME,
70+
BoxplotAggregationBuilder::new,
71+
(ContextParser<String, AggregationBuilder>) (p, c) -> BoxplotAggregationBuilder.parse(c, p))
72+
.addResultReader(InternalBoxplot::new)
6473
);
6574
}
6675

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
/*
2+
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
3+
* or more contributor license agreements. Licensed under the Elastic License;
4+
* you may not use this file except in compliance with the Elastic License.
5+
*/
6+
7+
package org.elasticsearch.xpack.analytics.boxplot;
8+
9+
import org.elasticsearch.search.aggregations.metrics.NumericMetricsAggregation;
10+
11+
public interface Boxplot extends NumericMetricsAggregation.MultiValue {
12+
13+
/**
14+
* @return The minimum value of all aggregated values.
15+
*/
16+
double getMin();
17+
18+
/**
19+
* @return The maximum value of all aggregated values.
20+
*/
21+
double getMax();
22+
23+
/**
24+
* @return The first quartile of all aggregated values.
25+
*/
26+
double getQ1();
27+
28+
/**
29+
* @return The second quartile of all aggregated values.
30+
*/
31+
double getQ2();
32+
33+
/**
34+
* @return The third quartile of all aggregated values.
35+
*/
36+
double getQ3();
37+
38+
/**
39+
* @return The minimum value of all aggregated values as a String.
40+
*/
41+
String getMinAsString();
42+
43+
/**
44+
* @return The maximum value of all aggregated values as a String.
45+
*/
46+
String getMaxAsString();
47+
48+
/**
49+
* @return The first quartile of all aggregated values as a String.
50+
*/
51+
String getQ1AsString();
52+
53+
/**
54+
* @return The second quartile of all aggregated values as a String.
55+
*/
56+
String getQ2AsString();
57+
58+
/**
59+
* @return The third quartile of all aggregated values as a String.
60+
*/
61+
String getQ3AsString();
62+
63+
}

0 commit comments

Comments
 (0)