Skip to content

Optimize broadcast spatial join #9890

Closed
@mbasmanova

Description

#9834 is about optimizing wide range of spatial queries. This task is about optimizing a subset of spatial joins where one relation is small enough to allow for a broadcast.

TL;DR Spatial join running as nested loop join spends all of its time deserializing slices into geometry objects. A series of optimizations aimed at reducing the number of deserialization operations has a dramatic effect on performance transforming a query that can't be run into one that completes in under a minute of wall time.

I'm interested in optimizing spatial joins, queries which combine relations based on spatial relationships between geometry objects. At the moment, I'm focusing on joins where one relation is relatively small, fits into a single machine and therefore can benefit from using a broadcast join. In particular, I've been taking a very close look at performance of a point-in-polygon join, one of the most common spatial joins. This join looks like this:

SELECT ...
FROM points, polygons
WHERE ST_Contains(ST_GeometryFromText(wkt), ST_Point(lon, lat))

I tried running these queries on production data as is. A query that I'm going to refer to throughout this note is joining 15K polygons with 10M points. That query started running, got to 99% very fast, and then stayed there. It never finished. I cancelled it and decided to find out what is taking time. I broke down the join into simple operations, measured execution time for each of them and built a cost model based on the size of the input data. Applying the cost model to a set of sample queries showed where the time was spent and helped identify effective optimizations. I went through 3 rounds of optimizations, each reducing the execution time by multiple orders of magnitude. The first round reduced the time by 1000x. The second round - by 100x. Final round - 7x. I started with estimated cost of 2 years and a hung query and finished with estimated cost of 2 minutes and a query completing in a minute using a total of 17 min of CPU time.

screen shot 2018-02-02 at 3 25 18 pm

Baseline: Nested Loop Join with a Filter

I'm turning on broadcast join using SET SESSION distributed_join=false; command and running EXPLAIN query to look at the query plan. As expected, there is a cross join node and a filter node on top of it.

- FilterProject[filterPredicate = "st_contains"("st_geometryfromtext"("wkt"), "st_point"("lon", "lat"))] => []
   - CrossJoin => [lat:double, lon:double, wkt:varchar]

Cross join generates all possible combinations of points and polygons, then the filter expression is evaluated for each combination. E.g. ST_Contains(ST_GeometryFromText(wkt), ST_Point(lon, lat)) expression is evaluated for each possible pair of a point and a polygon. Let's break down this expression into pieces and examine the costs of each piece.

  1. ST_GeometryFromText parses WKT string and creates an OGCGeometry object, then serializes it into Slice (fancy byte buffer).
    • ParseWktCost - this cost depends on the complexity of the polygon, e.g. number of points
  2. ST_Point(lon, lat) creates OGCPoint object and serializes it to Slice
    • NewPointCost - fixed cost
  3. ST_Contains partially deserializes Slice into an Envelope object representing bounding box of the polygon
    • DeserEnvelope - fixed cost
  4. ST_Contains partially deserializes Slice into an Envelope object representing bounding box of the point
    • DeserEnvelope - fixed cost
  5. ST_Contains checks if the bounding box of the polygon contains the bounding box of the point. If not, returns false.
    • CheckEnvelopeCost - fixed cost
  6. ST_Contains deserializes Slice into OGCGeometry object representing the polygon
    • DeserPolygonCost - this cost depends on polygon's complexity, e.g. number of points
  7. ST_Contains deserializes Slice into OGCPoint object representing the point
    • DeserPointCost - fixed cost
  8. ST_Contains checks whether the polygon contains the point
    • CheckGeometriesCost - this cost depends on the complexity of the polygon, e.g. number of points

NOTE: Steps 3-5 have been introduced very recently #9798. I'll build and evaluate a cost model that doesn't include these steps to show the effect these optimizations have on the join performance.

The steps above give us the following formula for the cost of evaluating ST_Contains expression for all pairs of points and polygons.

NumPolygons * NumPoints *
    (ParseWktCost + NewPointCost + DeserEnvelope + DeserEnvelope
        + CheckEnvelopesCost)
+ NumCandidatePairs * (DeserPolygonCost + DeserPointCost + CheckGeometriesCost)

, where NumPolygons is the number of polygons, NumPoints is the number of points and NumCandidatePairs is the number of pairs of (point, polygon) such that point lies within the polygon's bounding box. Steps 1 - 5 are executed for all pairs of points and polygons. Steps 6-8 are executed only for pairs which passed bounding box check in step 5.

I estimated the individual costs above using microbenchmarks. These benchmarks show that ParseWktCost, DeserPolygonCost and CheckGeometriesCost increase linearly with the number of points in the polygon.

operation cost in ns comments
ParseWktCost 400 per point
NewPointCost 99 fixed
DeserEnvelope 60 fixed
CheckEnvelopesCost 4 fixed
DeserPolygonCost 40 per point
DeserPointCost 114 fixed
CheckGeometriesCost 10 per point

Since variable costs increase linearly with the number of points in the polygon, I'm going to re-write the formula using AvgNumPointsPerPolygon and XxxCostPerPoint variables:

 NumPolygons * NumPoints * (
    AvgNumPointsPerPolygon * ParseWktCostPerPoint   // WKT parsing
    + NewPointCost + DeserEnvelope + DeserEnvelope + CheckEnvelopesCost)
 + NumCandidatePairs * (DeserPointCost        
    + AvgNumPointsPerPolygon * (DeserPolygonCostPerPoint + CheckGeometriesCostPerPoint))

Now, I'm ready to apply the formula to sample joins. Since I'm looking for a bottleneck, I'm going to split the formula into 3 parts and calculate what percentage of the total cost each part is responsible for.

  • WKT parsing:
    NumPolygons * NumPoints * AvgNumPointsPerPolygon * 400

  • Final evaluation of the candidate pairs (steps 6-8):
    NumCandidatePairs * (114 + AvgNumPointsPerPolygon * 50) // 50 = 40 + 10

  • Other:
    NumPolygons * NumPoints * 223 // = 99 + 60 + 60 + 4

I collected the following samples:

No NumPolygons NumPoints NumCandidatePairs AvgPointsPerPolygon
1 1,000 300,000 100,000 193
2 33 1,000,000 125,000 1,219
3 120 1,000,000 100,000 623
4 15,000 10,000,000 Unknown 928

I don't have NumCandidatePairs value for the last sample because it is too expensive to compute. I'll run the formula on a couple of different guesses.

No NumPolygons NumPoints NumCandidatePairs AvgPointsPerPolygon WKT (sec) FinalEval (sec) Other (sec) Total (sec) WKT (%)
1 1,000 300,000 100,000 193 23,160 110.01 66.9 23,336.91 99
2 33 1,000,000 125,000 1,219 16,090.8 868.54 7.36 16,966.7 95
3 120 1,000,000 100,000 623 29,904 355.11 26.76 30,285.87 99
4 15,000 10,000,000 1,000,000 928 55,680,000 5,289.6 33,450 55,718,739.6 100
4.1 15,000 10,000,000 10,000,000 928 55,680,000 52,896 33,450 55,766,346 100

Insight 1. WKT parsing is dominating the cost.

We are parsing WKT for every pair of a point and a polygon. WKT for the same polygon is parsed again and again for each point. Can we parse WKT only one per polygon? Absolutely.

Nested Loop with a Filter Optimized by Pushing Down WKT Parsing

Let's rewrite the query and push down ST_GeometryFromText(wkt) projection:

SELECT ...
FROM points, (SELECT ST_GeometryFromText(wkt) as geometry FROM polygons)
WHERE ST_Contains(geometry, ST_Point(lon, lat))

Let's update the cost model and compute new costs for the sample queries.

NumPolygons * AvgNumPointsPerPolygon * ParseWktCostPerPoint   // WKT parsing
+ NumPolygons * NumPoints * (
    NewPointCost + DeserEnvelope + DeserEnvelope + CheckEnvelopesCost)
+ NumCandidatePairs * (DeserPointCost        
    + AvgNumPointsPerPolygon * (DeserPolygonCostPerPoint + CheckGeometriesCostPerPoint))
No NumPolygons NumPoints NumCandidatePairs AvgPointsPerPolygon WKT (sec) FinalEval (sec) Other (sec) Total (sec) WKT (%) Final Eval (%) Other (%)
1 1,000 300,000 100,000 193 0.08 0.98 66.9 67.95 0 1 98
2 33 1,000,000 125,000 1,219 0.02 7.63 7.36 15.01 0 51 49
3 120 1,000,000 100,000 623 0.03 3.13 26.76 29.92 0 10 89
4 15,000 10,000,000 1,000,000 928 5.57 46.51 33,450 33,502.08 0 0 100
4.1 15,000 10,000,000 10,000,000 928 5.57 465.14 33,450 33,920.71 0 1 99

The cost dropped 1000 times, but it is still high in absolute numbers. Hence, I'll keep looking.

Custom Execution

Cost of WKT parsing is no longer significant, but the “other” category is. There are 4 operations in the “other” pile which are performed for each pair of points and polygons:

NumPoints * NumPolygons * (NewPointCost + DeserEnvelope + DeserEnvelope + CheckEnvelopesCost)   

These are creation of points and partial deserialization of points and polygons. Each point is created (via a call to ST_Point) and deserialized as many times as there are polygons. The total cost of creating points is

NumPoints * NumPolygons * NewPointCost

and the cost of partial deserialization of points and polygons is

NumPoints * NumPolygons * DeserEnvelopeCost * 2

these costs add up to

No NumPolygons NumPoints Partial Deserialization of Points and Polygons New point creations Total
1 1,000 300,000 36 29.7 65.7
2 33 1,000,000 3.96 3.27 7.23
3 120 1,000,000 14.4 11.88 26.28
4 15,000 10,000,000 18,000 14,850 32,850
4.1 15,000 10,000,000 18,000 14,850 32,850

INSIGHT 2. 53% of the “OTHER” time is spent on partial deserialization of points and polygons.

INSIGHT 3. 44% of the “OTHER” time is spent on point creations.

Let's rewrite the query to push down ST_Point and eliminate the excessive costs of point creations.

SELECT ...
FROM (SELECT ST_Point(lon, lat) as point FROM points),
    (SELECT ST_GeometryFromText(wkt) as geometry FROM polygons)
WHERE ST_Contains(geometry, point)

To eliminate the remaining partial deserializations I implemented custom execution for the join: #9474 . In this implementation, each polygon is deserialized from Slice just once and the full set of polygon objects is organized into an R-Tree. Then, each point is deserialized from Slice just once, R-Tree is searched for a set of polygons with matching bounding boxes and final check is performed on each candidate. We have the following operations now:

  • For each polygon:
    • Parse WKT string
    • Deserialize Slice into OGCGeometry object
  • Build an R-Tree
  • For each point:
    • evaluate ST_Point(lon, lat) to create an OGCPoint object and serialize it to Slice
    • deserialize Slice into OGCPoint object
    • query R-Tree for matching bounding boxes
    • for each candidate match, check whether the polygon contains the point

The new cost model is

NumPolygons * AvgNumPointsPerPolygon * (
    ParseWktCostPerPoint + DeserPolygonCostPerPoint)
+ BuildRTreeCost
+ NumPoints * (NewPointCost + DeserPointCost + QueryRTreeCost)
+ NumCandidatePairs * AvgNumPointsPerPolygon * CheckGeometriesCostPerPoint

, where BuildRTreeCost is the cost of building an R-Tree and QueryRTreeCost is the cost of searching in the R-Tree. I ran microbenchmark using 15,000 polygons to find out these costs:

  • BuildRTreeCost - 10,000,000 ns
  • QueryRTreeCost - 300 ns

Applying the new model to sample queries gives the following results:

No NumPolygons NumPoints NumCandidatePairs AvgPointsPerPolygon WKT (sec) FinalEval (sec) Other (sec) Total (sec)
1 1,000 300,000 100,000 193 0.08 0.19 0.17 0.44
2 33 1,000,000 125,000 1,219 0.02 1.52 0.52 2.06
3 120 1,000,000 100,000 623 0.03 0.62 0.53 1.18
4 15,000 10,000,000 1,000,000 928 5.57 9.28 5.7 20.54
4.1 15,000 10,000,000 10,000,000 928 5.57 92.8 5.7 104.06

The cost dropped 100 times (from about 33,000 sec to about 100 sec). These numbers look promising. I'm ready to try to run production queries again.

No NumPolygons NumPoints NumCandidatePairs AvgPointsPerPolygon WKT (sec) FinalEval (sec) Other (sec) Total (sec) Total CPU Time (Actual, min)
x 14,930 9,560,607 6,108,848 928 5.54 56.69 5.47 67.7 15.37
y 647 41,844,107 323,335 550 0.14 1.78 21.49 23.41 7.7
z 134 10,839,296 392,683 241 0.01 0.95 5.57 6.53 1.75
u 1,122 9,733,033 200,450 628 0.28 1.26 5.03 6.57 3.63
w 5,202 9,323,410 11,813,304 755 1.57 89.19 4.95 95.71 16.27

Production queries completed and rather quickly!

Also, I'm now able to compute NumCandidatePairs for sample query 4 (it corresponds to production query 'x' in the table above). It is 6,000,000 - right in the middle of the two guesses I made.

Partial Deserialization in ST_Contains

We can now estimate the effects of a prior optimization of ST_Contains that introduced partial deserialization (steps 3-5 above). Without steps 3-5 the cost model becomes

NumPolygons * NumPoints * (
    AvgNumPointsPerPolygon * ParseWktCostPerPoint   // WKT parsing
    + NewPointCost + DeserPointCost
    + AvgNumPointsPerPolygon * DeserPolygonCostPerPoint + CheckEnvelopesCost)
+ NumCandidatePairs * AvgNumPointsPerPolygon * CheckGeometriesCostPerPoint

and the results of applying that model to sample queries are:

No NumPolygons NumPoints NumCandidatePairs AvgPointsPerPolygon WKT (sec) FinalEval (sec) Other (sec) Total (sec) WKT (%) Final Eval (%) Other (%)
1 1,000 300,000 100,000 193 23,160 0.19 2,381.1 25,541.29 91 0 9
2 33 1,000,000 125,000 1,219 16,090.8 1.52 1,616.24 17,708.56 91 0 9
3 120 1,000,000 100,000 623 29,904 0.62 3,016.44 32,921.06 91 0 9
4 15,000 10,000,000 1,000,000 928 55,680,000 9.28 5,600,550 61,280,559.28 91 0 9
4.1 15,000 10,000,000 10,000,000 928 55,680,000 92.8 5,600,550 61,280,642.8 91 0 9

We can now compare the following implementations:

  1. No optimizations
  2. Partial deserialization in ST_Contains optimization, no WKT pushdown
  3. WKT pushdown and partial deserialization in ST_Contains
  4. Custom execution
No No optimizations Only deserialization opt Both WKT and deser opt Custom execution
1 25,541 23,337 68 0
2 17,709 16,967 15 2
3 32,921 30,286 30 1
4 61,280,559 55,718,740 33,502 21
4.1 61,280,643 55,766,346 33,921 104

WKT optimization shed 90% of the cost and partial deserialization optimization eliminated the remaining 10%.

Use of R-Tree in Custom Execution

Finally, to estimate the benefits of using the R-Tree in custom execution I'm comparing costs of linear search and R-Tree lookups. E.g.

Linear search: NumPoints * NumPolygons * CheckEnvelopesCost

vs.

R-Tree lookups: BuildRTreeCost + NumPoints * QueryRTreeCost

Here are the numbers:

No NumPolygons NumPoints Rtree lookups Linear searches
1 1,000 300,000 0.1 1.2
2 33 1,000,000 0.31 0.13
3 120 1,000,000 0.31 0.48
4 15,000 10,000,000 3.01 600
4.1 15,000 10,000,000 3.01 600

The cost model without using the R-Tree is:

NumPolygons * AvgNumPointsPerPolygon * (
    ParseWktCostPerPoint + DeserPolygonCostPerPoint)
+ NumPoints * (NewPointCost + DeserPointCost + NumPolygons * CheckEnvelopesCost)
+ NumCandidatePairs * AvgNumPointsPerPolygon * CheckGeometriesCostPerPoint

and the costs for sample queries are:

No NumPolygons NumPoints NumCandidatePairs AvgPointsPerPolygon WKT (sec) FinalEval (sec) Other (sec) Total (sec)
1 1,000 300,000 100,000 193 0.08 0.19 1.27 1.54
2 33 1,000,000 125,000 1,219 0.02 1.52 0.35 1.89
3 120 1,000,000 100,000 623 0.03 0.62 0.7 1.35
4 15,000 10,000,000 1,000,000 928 5.57 9.28 602.69 617.53
4.1 15,000 10,000,000 10,000,000 928 5.57 92.8 602.69 701.05

Comparing costs for R-Tree and linear implementations side by side shows that R-Tree reduced the cost 7 times for the sample query 4.1.

No W/ R-Tree W/o R-Tree
1 0.44 1.54
2 2.06 1.89
3 1.18 1.35
4 20.54 617.53
4.1 104.06 701.05

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions