Description
#9834 is about optimizing wide range of spatial queries. This task is about optimizing a subset of spatial joins where one relation is small enough to allow for a broadcast.
TL;DR Spatial join running as nested loop join spends all of its time deserializing slices into geometry objects. A series of optimizations aimed at reducing the number of deserialization operations has a dramatic effect on performance transforming a query that can't be run into one that completes in under a minute of wall time.
I'm interested in optimizing spatial joins, queries which combine relations based on spatial relationships between geometry objects. At the moment, I'm focusing on joins where one relation is relatively small, fits into a single machine and therefore can benefit from using a broadcast join. In particular, I've been taking a very close look at performance of a point-in-polygon join, one of the most common spatial joins. This join looks like this:
SELECT ...
FROM points, polygons
WHERE ST_Contains(ST_GeometryFromText(wkt), ST_Point(lon, lat))
I tried running these queries on production data as is. A query that I'm going to refer to throughout this note is joining 15K polygons with 10M points. That query started running, got to 99% very fast, and then stayed there. It never finished. I cancelled it and decided to find out what is taking time. I broke down the join into simple operations, measured execution time for each of them and built a cost model based on the size of the input data. Applying the cost model to a set of sample queries showed where the time was spent and helped identify effective optimizations. I went through 3 rounds of optimizations, each reducing the execution time by multiple orders of magnitude. The first round reduced the time by 1000x. The second round - by 100x. Final round - 7x. I started with estimated cost of 2 years and a hung query and finished with estimated cost of 2 minutes and a query completing in a minute using a total of 17 min of CPU time.
Baseline: Nested Loop Join with a Filter
I'm turning on broadcast join using SET SESSION distributed_join=false; command and running EXPLAIN query to look at the query plan. As expected, there is a cross join node and a filter node on top of it.
- FilterProject[filterPredicate = "st_contains"("st_geometryfromtext"("wkt"), "st_point"("lon", "lat"))] => []
- CrossJoin => [lat:double, lon:double, wkt:varchar]
Cross join generates all possible combinations of points and polygons, then the filter expression is evaluated for each combination. E.g. ST_Contains(ST_GeometryFromText(wkt), ST_Point(lon, lat)) expression is evaluated for each possible pair of a point and a polygon. Let's break down this expression into pieces and examine the costs of each piece.
- ST_GeometryFromText parses WKT string and creates an OGCGeometry object, then serializes it into Slice (fancy byte buffer).
- ParseWktCost - this cost depends on the complexity of the polygon, e.g. number of points
- ST_Point(lon, lat) creates OGCPoint object and serializes it to Slice
- NewPointCost - fixed cost
- ST_Contains partially deserializes Slice into an Envelope object representing bounding box of the polygon
- DeserEnvelope - fixed cost
- ST_Contains partially deserializes Slice into an Envelope object representing bounding box of the point
- DeserEnvelope - fixed cost
- ST_Contains checks if the bounding box of the polygon contains the bounding box of the point. If not, returns false.
- CheckEnvelopeCost - fixed cost
- ST_Contains deserializes Slice into OGCGeometry object representing the polygon
- DeserPolygonCost - this cost depends on polygon's complexity, e.g. number of points
- ST_Contains deserializes Slice into OGCPoint object representing the point
- DeserPointCost - fixed cost
- ST_Contains checks whether the polygon contains the point
- CheckGeometriesCost - this cost depends on the complexity of the polygon, e.g. number of points
NOTE: Steps 3-5 have been introduced very recently #9798. I'll build and evaluate a cost model that doesn't include these steps to show the effect these optimizations have on the join performance.
The steps above give us the following formula for the cost of evaluating ST_Contains expression for all pairs of points and polygons.
NumPolygons * NumPoints *
(ParseWktCost + NewPointCost + DeserEnvelope + DeserEnvelope
+ CheckEnvelopesCost)
+ NumCandidatePairs * (DeserPolygonCost + DeserPointCost + CheckGeometriesCost)
, where NumPolygons is the number of polygons, NumPoints is the number of points and NumCandidatePairs is the number of pairs of (point, polygon) such that point lies within the polygon's bounding box. Steps 1 - 5 are executed for all pairs of points and polygons. Steps 6-8 are executed only for pairs which passed bounding box check in step 5.
I estimated the individual costs above using microbenchmarks. These benchmarks show that ParseWktCost, DeserPolygonCost and CheckGeometriesCost increase linearly with the number of points in the polygon.
operation | cost in ns | comments |
---|---|---|
ParseWktCost | 400 | per point |
NewPointCost | 99 | fixed |
DeserEnvelope | 60 | fixed |
CheckEnvelopesCost | 4 | fixed |
DeserPolygonCost | 40 | per point |
DeserPointCost | 114 | fixed |
CheckGeometriesCost | 10 | per point |
Since variable costs increase linearly with the number of points in the polygon, I'm going to re-write the formula using AvgNumPointsPerPolygon and XxxCostPerPoint variables:
NumPolygons * NumPoints * (
AvgNumPointsPerPolygon * ParseWktCostPerPoint // WKT parsing
+ NewPointCost + DeserEnvelope + DeserEnvelope + CheckEnvelopesCost)
+ NumCandidatePairs * (DeserPointCost
+ AvgNumPointsPerPolygon * (DeserPolygonCostPerPoint + CheckGeometriesCostPerPoint))
Now, I'm ready to apply the formula to sample joins. Since I'm looking for a bottleneck, I'm going to split the formula into 3 parts and calculate what percentage of the total cost each part is responsible for.
-
WKT parsing:
NumPolygons * NumPoints * AvgNumPointsPerPolygon * 400
-
Final evaluation of the candidate pairs (steps 6-8):
NumCandidatePairs * (114 + AvgNumPointsPerPolygon * 50) // 50 = 40 + 10
-
Other:
NumPolygons * NumPoints * 223 // = 99 + 60 + 60 + 4
I collected the following samples:
No | NumPolygons | NumPoints | NumCandidatePairs | AvgPointsPerPolygon |
---|---|---|---|---|
1 | 1,000 | 300,000 | 100,000 | 193 |
2 | 33 | 1,000,000 | 125,000 | 1,219 |
3 | 120 | 1,000,000 | 100,000 | 623 |
4 | 15,000 | 10,000,000 | Unknown | 928 |
I don't have NumCandidatePairs value for the last sample because it is too expensive to compute. I'll run the formula on a couple of different guesses.
No | NumPolygons | NumPoints | NumCandidatePairs | AvgPointsPerPolygon | WKT (sec) | FinalEval (sec) | Other (sec) | Total (sec) | WKT (%) |
---|---|---|---|---|---|---|---|---|---|
1 | 1,000 | 300,000 | 100,000 | 193 | 23,160 | 110.01 | 66.9 | 23,336.91 | 99 |
2 | 33 | 1,000,000 | 125,000 | 1,219 | 16,090.8 | 868.54 | 7.36 | 16,966.7 | 95 |
3 | 120 | 1,000,000 | 100,000 | 623 | 29,904 | 355.11 | 26.76 | 30,285.87 | 99 |
4 | 15,000 | 10,000,000 | 1,000,000 | 928 | 55,680,000 | 5,289.6 | 33,450 | 55,718,739.6 | 100 |
4.1 | 15,000 | 10,000,000 | 10,000,000 | 928 | 55,680,000 | 52,896 | 33,450 | 55,766,346 | 100 |
Insight 1. WKT parsing is dominating the cost.
We are parsing WKT for every pair of a point and a polygon. WKT for the same polygon is parsed again and again for each point. Can we parse WKT only one per polygon? Absolutely.
Nested Loop with a Filter Optimized by Pushing Down WKT Parsing
Let's rewrite the query and push down ST_GeometryFromText(wkt) projection:
SELECT ...
FROM points, (SELECT ST_GeometryFromText(wkt) as geometry FROM polygons)
WHERE ST_Contains(geometry, ST_Point(lon, lat))
Let's update the cost model and compute new costs for the sample queries.
NumPolygons * AvgNumPointsPerPolygon * ParseWktCostPerPoint // WKT parsing
+ NumPolygons * NumPoints * (
NewPointCost + DeserEnvelope + DeserEnvelope + CheckEnvelopesCost)
+ NumCandidatePairs * (DeserPointCost
+ AvgNumPointsPerPolygon * (DeserPolygonCostPerPoint + CheckGeometriesCostPerPoint))
No | NumPolygons | NumPoints | NumCandidatePairs | AvgPointsPerPolygon | WKT (sec) | FinalEval (sec) | Other (sec) | Total (sec) | WKT (%) | Final Eval (%) | Other (%) |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1,000 | 300,000 | 100,000 | 193 | 0.08 | 0.98 | 66.9 | 67.95 | 0 | 1 | 98 |
2 | 33 | 1,000,000 | 125,000 | 1,219 | 0.02 | 7.63 | 7.36 | 15.01 | 0 | 51 | 49 |
3 | 120 | 1,000,000 | 100,000 | 623 | 0.03 | 3.13 | 26.76 | 29.92 | 0 | 10 | 89 |
4 | 15,000 | 10,000,000 | 1,000,000 | 928 | 5.57 | 46.51 | 33,450 | 33,502.08 | 0 | 0 | 100 |
4.1 | 15,000 | 10,000,000 | 10,000,000 | 928 | 5.57 | 465.14 | 33,450 | 33,920.71 | 0 | 1 | 99 |
The cost dropped 1000 times, but it is still high in absolute numbers. Hence, I'll keep looking.
Custom Execution
Cost of WKT parsing is no longer significant, but the “other” category is. There are 4 operations in the “other” pile which are performed for each pair of points and polygons:
NumPoints * NumPolygons * (NewPointCost + DeserEnvelope + DeserEnvelope + CheckEnvelopesCost)
These are creation of points and partial deserialization of points and polygons. Each point is created (via a call to ST_Point) and deserialized as many times as there are polygons. The total cost of creating points is
NumPoints * NumPolygons * NewPointCost
and the cost of partial deserialization of points and polygons is
NumPoints * NumPolygons * DeserEnvelopeCost * 2
these costs add up to
No | NumPolygons | NumPoints | Partial Deserialization of Points and Polygons | New point creations | Total |
---|---|---|---|---|---|
1 | 1,000 | 300,000 | 36 | 29.7 | 65.7 |
2 | 33 | 1,000,000 | 3.96 | 3.27 | 7.23 |
3 | 120 | 1,000,000 | 14.4 | 11.88 | 26.28 |
4 | 15,000 | 10,000,000 | 18,000 | 14,850 | 32,850 |
4.1 | 15,000 | 10,000,000 | 18,000 | 14,850 | 32,850 |
INSIGHT 2. 53% of the “OTHER” time is spent on partial deserialization of points and polygons.
INSIGHT 3. 44% of the “OTHER” time is spent on point creations.
Let's rewrite the query to push down ST_Point and eliminate the excessive costs of point creations.
SELECT ...
FROM (SELECT ST_Point(lon, lat) as point FROM points),
(SELECT ST_GeometryFromText(wkt) as geometry FROM polygons)
WHERE ST_Contains(geometry, point)
To eliminate the remaining partial deserializations I implemented custom execution for the join: #9474 . In this implementation, each polygon is deserialized from Slice just once and the full set of polygon objects is organized into an R-Tree. Then, each point is deserialized from Slice just once, R-Tree is searched for a set of polygons with matching bounding boxes and final check is performed on each candidate. We have the following operations now:
- For each polygon:
- Parse WKT string
- Deserialize Slice into OGCGeometry object
- Build an R-Tree
- For each point:
- evaluate ST_Point(lon, lat) to create an OGCPoint object and serialize it to Slice
- deserialize Slice into OGCPoint object
- query R-Tree for matching bounding boxes
- for each candidate match, check whether the polygon contains the point
The new cost model is
NumPolygons * AvgNumPointsPerPolygon * (
ParseWktCostPerPoint + DeserPolygonCostPerPoint)
+ BuildRTreeCost
+ NumPoints * (NewPointCost + DeserPointCost + QueryRTreeCost)
+ NumCandidatePairs * AvgNumPointsPerPolygon * CheckGeometriesCostPerPoint
, where BuildRTreeCost is the cost of building an R-Tree and QueryRTreeCost is the cost of searching in the R-Tree. I ran microbenchmark using 15,000 polygons to find out these costs:
- BuildRTreeCost - 10,000,000 ns
- QueryRTreeCost - 300 ns
Applying the new model to sample queries gives the following results:
No | NumPolygons | NumPoints | NumCandidatePairs | AvgPointsPerPolygon | WKT (sec) | FinalEval (sec) | Other (sec) | Total (sec) |
---|---|---|---|---|---|---|---|---|
1 | 1,000 | 300,000 | 100,000 | 193 | 0.08 | 0.19 | 0.17 | 0.44 |
2 | 33 | 1,000,000 | 125,000 | 1,219 | 0.02 | 1.52 | 0.52 | 2.06 |
3 | 120 | 1,000,000 | 100,000 | 623 | 0.03 | 0.62 | 0.53 | 1.18 |
4 | 15,000 | 10,000,000 | 1,000,000 | 928 | 5.57 | 9.28 | 5.7 | 20.54 |
4.1 | 15,000 | 10,000,000 | 10,000,000 | 928 | 5.57 | 92.8 | 5.7 | 104.06 |
The cost dropped 100 times (from about 33,000 sec to about 100 sec). These numbers look promising. I'm ready to try to run production queries again.
No | NumPolygons | NumPoints | NumCandidatePairs | AvgPointsPerPolygon | WKT (sec) | FinalEval (sec) | Other (sec) | Total (sec) | Total CPU Time (Actual, min) |
---|---|---|---|---|---|---|---|---|---|
x | 14,930 | 9,560,607 | 6,108,848 | 928 | 5.54 | 56.69 | 5.47 | 67.7 | 15.37 |
y | 647 | 41,844,107 | 323,335 | 550 | 0.14 | 1.78 | 21.49 | 23.41 | 7.7 |
z | 134 | 10,839,296 | 392,683 | 241 | 0.01 | 0.95 | 5.57 | 6.53 | 1.75 |
u | 1,122 | 9,733,033 | 200,450 | 628 | 0.28 | 1.26 | 5.03 | 6.57 | 3.63 |
w | 5,202 | 9,323,410 | 11,813,304 | 755 | 1.57 | 89.19 | 4.95 | 95.71 | 16.27 |
Production queries completed and rather quickly!
Also, I'm now able to compute NumCandidatePairs for sample query 4 (it corresponds to production query 'x' in the table above). It is 6,000,000 - right in the middle of the two guesses I made.
Partial Deserialization in ST_Contains
We can now estimate the effects of a prior optimization of ST_Contains that introduced partial deserialization (steps 3-5 above). Without steps 3-5 the cost model becomes
NumPolygons * NumPoints * (
AvgNumPointsPerPolygon * ParseWktCostPerPoint // WKT parsing
+ NewPointCost + DeserPointCost
+ AvgNumPointsPerPolygon * DeserPolygonCostPerPoint + CheckEnvelopesCost)
+ NumCandidatePairs * AvgNumPointsPerPolygon * CheckGeometriesCostPerPoint
and the results of applying that model to sample queries are:
No | NumPolygons | NumPoints | NumCandidatePairs | AvgPointsPerPolygon | WKT (sec) | FinalEval (sec) | Other (sec) | Total (sec) | WKT (%) | Final Eval (%) | Other (%) |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1,000 | 300,000 | 100,000 | 193 | 23,160 | 0.19 | 2,381.1 | 25,541.29 | 91 | 0 | 9 |
2 | 33 | 1,000,000 | 125,000 | 1,219 | 16,090.8 | 1.52 | 1,616.24 | 17,708.56 | 91 | 0 | 9 |
3 | 120 | 1,000,000 | 100,000 | 623 | 29,904 | 0.62 | 3,016.44 | 32,921.06 | 91 | 0 | 9 |
4 | 15,000 | 10,000,000 | 1,000,000 | 928 | 55,680,000 | 9.28 | 5,600,550 | 61,280,559.28 | 91 | 0 | 9 |
4.1 | 15,000 | 10,000,000 | 10,000,000 | 928 | 55,680,000 | 92.8 | 5,600,550 | 61,280,642.8 | 91 | 0 | 9 |
We can now compare the following implementations:
- No optimizations
- Partial deserialization in ST_Contains optimization, no WKT pushdown
- WKT pushdown and partial deserialization in ST_Contains
- Custom execution
No | No optimizations | Only deserialization opt | Both WKT and deser opt | Custom execution |
---|---|---|---|---|
1 | 25,541 | 23,337 | 68 | 0 |
2 | 17,709 | 16,967 | 15 | 2 |
3 | 32,921 | 30,286 | 30 | 1 |
4 | 61,280,559 | 55,718,740 | 33,502 | 21 |
4.1 | 61,280,643 | 55,766,346 | 33,921 | 104 |
WKT optimization shed 90% of the cost and partial deserialization optimization eliminated the remaining 10%.
Use of R-Tree in Custom Execution
Finally, to estimate the benefits of using the R-Tree in custom execution I'm comparing costs of linear search and R-Tree lookups. E.g.
Linear search: NumPoints * NumPolygons * CheckEnvelopesCost
vs.
R-Tree lookups: BuildRTreeCost + NumPoints * QueryRTreeCost
Here are the numbers:
No | NumPolygons | NumPoints | Rtree lookups | Linear searches |
---|---|---|---|---|
1 | 1,000 | 300,000 | 0.1 | 1.2 |
2 | 33 | 1,000,000 | 0.31 | 0.13 |
3 | 120 | 1,000,000 | 0.31 | 0.48 |
4 | 15,000 | 10,000,000 | 3.01 | 600 |
4.1 | 15,000 | 10,000,000 | 3.01 | 600 |
The cost model without using the R-Tree is:
NumPolygons * AvgNumPointsPerPolygon * (
ParseWktCostPerPoint + DeserPolygonCostPerPoint)
+ NumPoints * (NewPointCost + DeserPointCost + NumPolygons * CheckEnvelopesCost)
+ NumCandidatePairs * AvgNumPointsPerPolygon * CheckGeometriesCostPerPoint
and the costs for sample queries are:
No | NumPolygons | NumPoints | NumCandidatePairs | AvgPointsPerPolygon | WKT (sec) | FinalEval (sec) | Other (sec) | Total (sec) |
---|---|---|---|---|---|---|---|---|
1 | 1,000 | 300,000 | 100,000 | 193 | 0.08 | 0.19 | 1.27 | 1.54 |
2 | 33 | 1,000,000 | 125,000 | 1,219 | 0.02 | 1.52 | 0.35 | 1.89 |
3 | 120 | 1,000,000 | 100,000 | 623 | 0.03 | 0.62 | 0.7 | 1.35 |
4 | 15,000 | 10,000,000 | 1,000,000 | 928 | 5.57 | 9.28 | 602.69 | 617.53 |
4.1 | 15,000 | 10,000,000 | 10,000,000 | 928 | 5.57 | 92.8 | 602.69 | 701.05 |
Comparing costs for R-Tree and linear implementations side by side shows that R-Tree reduced the cost 7 times for the sample query 4.1.
No | W/ R-Tree | W/o R-Tree |
---|---|---|
1 | 0.44 | 1.54 |
2 | 2.06 | 1.89 |
3 | 1.18 | 1.35 |
4 | 20.54 | 617.53 |
4.1 | 104.06 | 701.05 |