Optimize broadcast spatial join

#9834 is about optimizing wide range of spatial queries. This task is about optimizing a subset of spatial joins where one relation is small enough to allow for a broadcast.

**TL;DR** Spatial join running as nested loop join spends all of its time deserializing slices into geometry objects. A series of optimizations aimed at reducing the number of deserialization operations has a dramatic effect on performance transforming a query that can't be run into one that completes in under a minute of wall time.

I'm interested in optimizing spatial joins, queries which combine relations based on spatial relationships between geometry objects. At the moment, I'm focusing on joins where one relation is relatively small, fits into a single machine and therefore can benefit from using a broadcast join. In particular, I've been taking a very close look at performance of a point-in-polygon join, one of the most common spatial joins. This join looks like this:

```
SELECT ...
FROM points, polygons
WHERE ST_Contains(ST_GeometryFromText(wkt), ST_Point(lon, lat))
```

I tried running these queries on production data as is. A query that I'm going to refer to throughout this note is joining 15K polygons with 10M points. That query started running, got to 99% very fast, and then stayed there. It never finished. I cancelled it and decided to find out what is taking time. I broke down the join into simple operations, measured execution time for each of them and built a cost model based on the size of the input data. Applying the cost model to a set of sample queries showed where the time was spent and helped identify effective optimizations. I went through 3 rounds of optimizations, each reducing the execution time by multiple orders of magnitude. The first round reduced the time by 1000x. The second round - by 100x. Final round - 7x. I started with estimated cost of 2 years and a hung query and finished with estimated cost of 2 minutes and a query completing in a minute using a total of 17 min of CPU time.

<img width="1087" alt="screen shot 2018-02-02 at 3 25 18 pm" src="https://user-images.githubusercontent.com/27965151/35753547-38676b20-082e-11e8-87fe-d3029cf0f703.png">

## Baseline: Nested Loop Join with a Filter

I'm turning on broadcast join using SET SESSION distributed_join=false; command and running EXPLAIN query to look at the query plan. As expected, there is a cross join node and a filter node on top of it.

```
- FilterProject[filterPredicate = "st_contains"("st_geometryfromtext"("wkt"), "st_point"("lon", "lat"))] => []
   - CrossJoin => [lat:double, lon:double, wkt:varchar]
```
Cross join generates all possible combinations of points and polygons, then the filter expression is evaluated for each combination. E.g. ST_Contains(ST_GeometryFromText(wkt), ST_Point(lon, lat)) expression is evaluated for each possible pair of a point and a polygon. Let's break down this expression into pieces and examine the costs of each piece.

1. ST_GeometryFromText parses WKT string and creates an OGCGeometry object, then serializes it into Slice (fancy byte buffer).
    - ParseWktCost - this cost depends on the complexity of the polygon, e.g. number of points
2. ST_Point(lon, lat) creates OGCPoint object and serializes it to Slice
    - NewPointCost - fixed cost
3. ST_Contains partially deserializes Slice into an Envelope object representing bounding box of the polygon
    - DeserEnvelope - fixed cost
4. ST_Contains partially deserializes Slice into an Envelope object representing bounding box of the point
    - DeserEnvelope - fixed cost
5. ST_Contains checks if the bounding box of the polygon contains the bounding box of the point. If not, returns false.
    - CheckEnvelopeCost - fixed cost
6. ST_Contains deserializes Slice into OGCGeometry object representing the polygon
    - DeserPolygonCost - this cost depends on polygon's complexity, e.g. number of points
7. ST_Contains deserializes Slice into OGCPoint object representing the point
    - DeserPointCost - fixed cost
8. ST_Contains checks whether the polygon contains the point
    - CheckGeometriesCost - this cost depends on the complexity of the polygon, e.g. number of points

NOTE: Steps 3-5 have been introduced very recently #9798. I'll build and evaluate a cost model that doesn't include these steps to show the effect these optimizations have on the join performance.

The steps above give us the following formula for the cost of evaluating ST_Contains expression for all pairs of points and polygons.

```
NumPolygons * NumPoints *
    (ParseWktCost + NewPointCost + DeserEnvelope + DeserEnvelope
        + CheckEnvelopesCost)
+ NumCandidatePairs * (DeserPolygonCost + DeserPointCost + CheckGeometriesCost)
```

, where NumPolygons is the number of polygons, NumPoints is the number of points and NumCandidatePairs is the number of pairs of (point, polygon) such that point lies within the polygon's bounding box. Steps 1 - 5 are executed for all pairs of points and polygons. Steps 6-8 are executed only for pairs which passed bounding box check in step 5.

I estimated the individual costs above using microbenchmarks. These benchmarks show that ParseWktCost, DeserPolygonCost and CheckGeometriesCost increase linearly with the number of points in the polygon.

operation | cost in ns | comments
-- | -- | --
ParseWktCost | 400 | per point
NewPointCost | 99 | fixed
DeserEnvelope | 60 | fixed
CheckEnvelopesCost | 4 | fixed
DeserPolygonCost | 40 | per point
DeserPointCost | 114 | fixed
CheckGeometriesCost | 10 | per point

Since variable costs increase linearly with the number of points in the polygon, I'm going to re-write the formula using AvgNumPointsPerPolygon and XxxCostPerPoint variables:

```
 NumPolygons * NumPoints * (
    AvgNumPointsPerPolygon * ParseWktCostPerPoint   // WKT parsing
    + NewPointCost + DeserEnvelope + DeserEnvelope + CheckEnvelopesCost)
 + NumCandidatePairs * (DeserPointCost        
    + AvgNumPointsPerPolygon * (DeserPolygonCostPerPoint + CheckGeometriesCostPerPoint))
```

Now, I'm ready to apply the formula to sample joins. Since I'm looking for a bottleneck, I'm going to split the formula into 3 parts and calculate what percentage of the total cost each part is responsible for.

* WKT parsing: 
`NumPolygons * NumPoints * AvgNumPointsPerPolygon * 400`

* Final evaluation of the candidate pairs (steps 6-8): 
`NumCandidatePairs * (114 + AvgNumPointsPerPolygon * 50) // 50 = 40 + 10`

* Other: 
`NumPolygons * NumPoints * 223 // = 99 + 60 + 60 + 4`

I collected the following samples:

No | NumPolygons | NumPoints | NumCandidatePairs | AvgPointsPerPolygon
-- | -- | -- | -- | --
1 | 1,000 | 300,000 | 100,000 | 193
2 | 33 | 1,000,000 | 125,000 | 1,219
3 | 120 | 1,000,000 | 100,000 | 623
4 | 15,000 | 10,000,000 | Unknown | 928

I don't have NumCandidatePairs value for the last sample because it is too expensive to compute. I'll run the formula on a couple of different guesses.

No | NumPolygons | NumPoints | NumCandidatePairs | AvgPointsPerPolygon | WKT (sec) | FinalEval (sec) | Other (sec) | Total (sec) | WKT (%)
-- | -- | -- | -- | -- | -- | -- | -- | -- | --
1 | 1,000 | 300,000 | 100,000 | 193 | 23,160 | 110.01 | 66.9 | 23,336.91 | 99
2 | 33 | 1,000,000 | 125,000 | 1,219 | 16,090.8 | 868.54 | 7.36 | 16,966.7 | 95
3 | 120 | 1,000,000 | 100,000 | 623 | 29,904 | 355.11 | 26.76 | 30,285.87 | 99
4 | 15,000 | 10,000,000 | 1,000,000 | 928 | 55,680,000 | 5,289.6 | 33,450 | 55,718,739.6 | 100
4.1 | 15,000 | 10,000,000 | 10,000,000 | 928 | 55,680,000 | 52,896 | 33,450 | 55,766,346 | 100

### Insight 1. WKT parsing is dominating the cost.

We are parsing WKT for every pair of a point and a polygon. WKT for the same polygon is parsed again and again for each point. Can we parse WKT only one per polygon? Absolutely.

## Nested Loop with a Filter Optimized by Pushing Down WKT Parsing

Let's rewrite the query and push down ST_GeometryFromText(wkt) projection:

```
SELECT ...
FROM points, (SELECT ST_GeometryFromText(wkt) as geometry FROM polygons)
WHERE ST_Contains(geometry, ST_Point(lon, lat))
```

Let's update the cost model and compute new costs for the sample queries.

```
NumPolygons * AvgNumPointsPerPolygon * ParseWktCostPerPoint   // WKT parsing
+ NumPolygons * NumPoints * (
    NewPointCost + DeserEnvelope + DeserEnvelope + CheckEnvelopesCost)
+ NumCandidatePairs * (DeserPointCost        
    + AvgNumPointsPerPolygon * (DeserPolygonCostPerPoint + CheckGeometriesCostPerPoint))
```

No | NumPolygons | NumPoints | NumCandidatePairs | AvgPointsPerPolygon | WKT (sec) | FinalEval (sec) | Other (sec) | Total (sec) | WKT (%) | Final Eval (%) | Other (%)
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
1 | 1,000 | 300,000 | 100,000 | 193 | 0.08 | 0.98 | 66.9 | 67.95 | 0 | 1 | 98
2 | 33 | 1,000,000 | 125,000 | 1,219 | 0.02 | 7.63 | 7.36 | 15.01 | 0 | 51 | 49
3 | 120 | 1,000,000 | 100,000 | 623 | 0.03 | 3.13 | 26.76 | 29.92 | 0 | 10 | 89
4 | 15,000 | 10,000,000 | 1,000,000 | 928 | 5.57 | 46.51 | 33,450 | 33,502.08 | 0 | 0 | 100
4.1 | 15,000 | 10,000,000 | 10,000,000 | 928 | 5.57 | 465.14 | 33,450 | 33,920.71 | 0 | 1 | 99

The cost dropped 1000 times, but it is still high in absolute numbers. Hence, I'll keep looking.

## Custom Execution

Cost of WKT parsing is no longer significant, but the “other” category is. There are 4 operations in the “other” pile which are performed for each pair of points and polygons:

```
NumPoints * NumPolygons * (NewPointCost + DeserEnvelope + DeserEnvelope + CheckEnvelopesCost)   
```

These are creation of points and partial deserialization of points and polygons. Each point is created (via a call to ST_Point) and deserialized as many times as there are polygons. The total cost of creating points is

`NumPoints * NumPolygons * NewPointCost`

and the cost of partial deserialization of points and polygons is

`NumPoints * NumPolygons * DeserEnvelopeCost * 2`

these costs add up to

No | NumPolygons | NumPoints | Partial Deserialization of Points and Polygons | New point creations | Total
-- | -- | -- | -- | -- | --
1 | 1,000 | 300,000 | 36 | 29.7 | 65.7
2 | 33 | 1,000,000 | 3.96 | 3.27 | 7.23
3 | 120 | 1,000,000 | 14.4 | 11.88 | 26.28
4 | 15,000 | 10,000,000 | 18,000 | 14,850 | 32,850
4.1 | 15,000 | 10,000,000 | 18,000 | 14,850 | 32,850

### INSIGHT 2. 53% of the “OTHER” time is spent on partial deserialization of points and polygons.

### INSIGHT 3. 44% of the “OTHER” time is spent on point creations.

Let's rewrite the query to push down ST_Point and eliminate the excessive costs of point creations.

```
SELECT ...
FROM (SELECT ST_Point(lon, lat) as point FROM points),
    (SELECT ST_GeometryFromText(wkt) as geometry FROM polygons)
WHERE ST_Contains(geometry, point)
```

To eliminate the remaining partial deserializations I implemented custom execution for the join: #9474 . In this implementation, each polygon is deserialized from Slice just once and the full set of polygon objects is organized into an R-Tree. Then, each point is deserialized from Slice just once, R-Tree is searched for a set of polygons with matching bounding boxes and final check is performed on each candidate. We have the following operations now:

- For each polygon:
    - Parse WKT string
    - Deserialize Slice into OGCGeometry object
- Build an R-Tree
- For each point:
    - evaluate ST_Point(lon, lat) to create an OGCPoint object and serialize it to Slice
    - deserialize Slice into OGCPoint object
    - query R-Tree for matching bounding boxes
    - for each candidate match, check whether the polygon contains the point

The new cost model is

```
NumPolygons * AvgNumPointsPerPolygon * (
    ParseWktCostPerPoint + DeserPolygonCostPerPoint)
+ BuildRTreeCost
+ NumPoints * (NewPointCost + DeserPointCost + QueryRTreeCost)
+ NumCandidatePairs * AvgNumPointsPerPolygon * CheckGeometriesCostPerPoint
```

, where BuildRTreeCost is the cost of building an R-Tree and QueryRTreeCost is the cost of searching in the R-Tree. I ran microbenchmark using 15,000 polygons to find out these costs:

* BuildRTreeCost - 10,000,000 ns
* QueryRTreeCost - 300 ns

Applying the new model to sample queries gives the following results:

No | NumPolygons | NumPoints | NumCandidatePairs | AvgPointsPerPolygon | WKT (sec) | FinalEval (sec) | Other (sec) | Total (sec)
-- | -- | -- | -- | -- | -- | -- | -- | --
1 | 1,000 | 300,000 | 100,000 | 193 | 0.08 | 0.19 | 0.17 | 0.44
2 | 33 | 1,000,000 | 125,000 | 1,219 | 0.02 | 1.52 | 0.52 | 2.06
3 | 120 | 1,000,000 | 100,000 | 623 | 0.03 | 0.62 | 0.53 | 1.18
4 | 15,000 | 10,000,000 | 1,000,000 | 928 | 5.57 | 9.28 | 5.7 | 20.54
4.1 | 15,000 | 10,000,000 | 10,000,000 | 928 | 5.57 | 92.8 | 5.7 | 104.06

The cost dropped 100 times (from about 33,000 sec to about 100 sec). These numbers look promising. I'm ready to try to run production queries again.

No | NumPolygons | NumPoints | NumCandidatePairs | AvgPointsPerPolygon | WKT (sec) | FinalEval (sec) | Other (sec) | Total (sec) | Total CPU Time (Actual, min)
-- | -- | -- | -- | -- | -- | -- | -- | -- | --
x | 14,930 | 9,560,607 | 6,108,848 | 928 | 5.54 | 56.69 | 5.47 | 67.7 | 15.37
y | 647 | 41,844,107 | 323,335 | 550 | 0.14 | 1.78 | 21.49 | 23.41 | 7.7
z | 134 | 10,839,296 | 392,683 | 241 | 0.01 | 0.95 | 5.57 | 6.53 | 1.75
u | 1,122 | 9,733,033 | 200,450 | 628 | 0.28 | 1.26 | 5.03 | 6.57 | 3.63
w | 5,202 | 9,323,410 | 11,813,304 | 755 | 1.57 | 89.19 | 4.95 | 95.71 | 16.27

**Production queries completed and rather quickly!**

Also, I'm now able to compute NumCandidatePairs for sample query 4 (it corresponds to production query 'x' in the table above). It is 6,000,000 - right in the middle of the two guesses I made. 

## Partial Deserialization in ST_Contains

We can now estimate the effects of a prior optimization of ST_Contains that introduced partial deserialization (steps 3-5 above). Without steps 3-5 the cost model becomes

```
NumPolygons * NumPoints * (
    AvgNumPointsPerPolygon * ParseWktCostPerPoint   // WKT parsing
    + NewPointCost + DeserPointCost
    + AvgNumPointsPerPolygon * DeserPolygonCostPerPoint + CheckEnvelopesCost)
+ NumCandidatePairs * AvgNumPointsPerPolygon * CheckGeometriesCostPerPoint
```

and the results of applying that model to sample queries are:

No | NumPolygons | NumPoints | NumCandidatePairs | AvgPointsPerPolygon | WKT (sec) | FinalEval (sec) | Other (sec) | Total (sec) | WKT (%) | Final Eval (%) | Other (%)
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
1 | 1,000 | 300,000 | 100,000 | 193 | 23,160 | 0.19 | 2,381.1 | 25,541.29 | 91 | 0 | 9
2 | 33 | 1,000,000 | 125,000 | 1,219 | 16,090.8 | 1.52 | 1,616.24 | 17,708.56 | 91 | 0 | 9
3 | 120 | 1,000,000 | 100,000 | 623 | 29,904 | 0.62 | 3,016.44 | 32,921.06 | 91 | 0 | 9
4 | 15,000 | 10,000,000 | 1,000,000 | 928 | 55,680,000 | 9.28 | 5,600,550 | 61,280,559.28 | 91 | 0 | 9
4.1 | 15,000 | 10,000,000 | 10,000,000 | 928 | 55,680,000 | 92.8 | 5,600,550 | 61,280,642.8 | 91 | 0 | 9

We can now compare the following implementations:

1. No optimizations
2. Partial deserialization in ST_Contains optimization, no WKT pushdown
3. WKT pushdown and partial deserialization in ST_Contains 
4. Custom execution

No | No optimizations | Only deserialization opt | Both WKT and deser opt | Custom execution
-- | -- | -- | -- | --
1 | 25,541 | 23,337 | 68 | 0
2 | 17,709 | 16,967 | 15 | 2
3 | 32,921 | 30,286 | 30 | 1
4 | 61,280,559 | 55,718,740 | 33,502 | 21
4.1 | 61,280,643 | 55,766,346 | 33,921 | 104

WKT optimization shed 90% of the cost and partial deserialization optimization eliminated the remaining 10%.

## Use of R-Tree in Custom Execution

Finally, to estimate the benefits of using the R-Tree in custom execution I'm comparing costs of linear search and R-Tree lookups. E.g.

`Linear search: NumPoints * NumPolygons * CheckEnvelopesCost`

vs.

`R-Tree lookups: BuildRTreeCost + NumPoints * QueryRTreeCost`

Here are the numbers:

No | NumPolygons | NumPoints | Rtree lookups | Linear searches
-- | -- | -- | -- | --
1 | 1,000 | 300,000 | 0.1 | 1.2
2 | 33 | 1,000,000 | 0.31 | 0.13
3 | 120 | 1,000,000 | 0.31 | 0.48
4 | 15,000 | 10,000,000 | 3.01 | 600
4.1 | 15,000 | 10,000,000 | 3.01 | 600

The cost model without using the R-Tree is:

```
NumPolygons * AvgNumPointsPerPolygon * (
    ParseWktCostPerPoint + DeserPolygonCostPerPoint)
+ NumPoints * (NewPointCost + DeserPointCost + NumPolygons * CheckEnvelopesCost)
+ NumCandidatePairs * AvgNumPointsPerPolygon * CheckGeometriesCostPerPoint
```

and the costs for sample queries are:

No | NumPolygons | NumPoints | NumCandidatePairs | AvgPointsPerPolygon | WKT (sec) | FinalEval (sec) | Other (sec) | Total (sec)
-- | -- | -- | -- | -- | -- | -- | -- | --
1 | 1,000 | 300,000 | 100,000 | 193 | 0.08 | 0.19 | 1.27 | 1.54
2 | 33 | 1,000,000 | 125,000 | 1,219 | 0.02 | 1.52 | 0.35 | 1.89
3 | 120 | 1,000,000 | 100,000 | 623 | 0.03 | 0.62 | 0.7 | 1.35
4 | 15,000 | 10,000,000 | 1,000,000 | 928 | 5.57 | 9.28 | 602.69 | 617.53
4.1 | 15,000 | 10,000,000 | 10,000,000 | 928 | 5.57 | 92.8 | 602.69 | 701.05

Comparing costs for R-Tree and linear implementations side by side shows that R-Tree reduced the cost 7 times for the sample query 4.1.

No | W/ R-Tree | W/o R-Tree
-- | -- | --
1 | 0.44 | 1.54
2 | 2.06 | 1.89
3 | 1.18 | 1.35
4 | 20.54 | 617.53
4.1 | 104.06 | 701.05


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize broadcast spatial join #9890

Baseline: Nested Loop Join with a Filter

Insight 1. WKT parsing is dominating the cost.

Nested Loop with a Filter Optimized by Pushing Down WKT Parsing

Custom Execution

INSIGHT 2. 53% of the “OTHER” time is spent on partial deserialization of points and polygons.

INSIGHT 3. 44% of the “OTHER” time is spent on point creations.

Partial Deserialization in ST_Contains

Use of R-Tree in Custom Execution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

operation	cost in ns	comments
ParseWktCost	400	per point
NewPointCost	99	fixed
DeserEnvelope	60	fixed
CheckEnvelopesCost	4	fixed
DeserPolygonCost	40	per point
DeserPointCost	114	fixed
CheckGeometriesCost	10	per point

No	NumPolygons	NumPoints	NumCandidatePairs	AvgPointsPerPolygon
1	1,000	300,000	100,000	193
2	33	1,000,000	125,000	1,219
3	120	1,000,000	100,000	623
4	15,000	10,000,000	Unknown	928

No	NumPolygons	NumPoints	Partial Deserialization of Points and Polygons	New point creations	Total
1	1,000	300,000	36	29.7	65.7
2	33	1,000,000	3.96	3.27	7.23
3	120	1,000,000	14.4	11.88	26.28
4	15,000	10,000,000	18,000	14,850	32,850
4.1	15,000	10,000,000	18,000	14,850	32,850

No	NumPolygons	NumPoints	NumCandidatePairs	AvgPointsPerPolygon	WKT (sec)	FinalEval (sec)	Other (sec)	Total (sec)	Total CPU Time (Actual, min)
x	14,930	9,560,607	6,108,848	928	5.54	56.69	5.47	67.7	15.37
y	647	41,844,107	323,335	550	0.14	1.78	21.49	23.41	7.7
z	134	10,839,296	392,683	241	0.01	0.95	5.57	6.53	1.75
u	1,122	9,733,033	200,450	628	0.28	1.26	5.03	6.57	3.63
w	5,202	9,323,410	11,813,304	755	1.57	89.19	4.95	95.71	16.27

No	No optimizations	Only deserialization opt	Both WKT and deser opt	Custom execution
1	25,541	23,337	68	0
2	17,709	16,967	15	2
3	32,921	30,286	30	1
4	61,280,559	55,718,740	33,502	21
4.1	61,280,643	55,766,346	33,921	104