-
Notifications
You must be signed in to change notification settings - Fork 69
Description
I load a dataset of ~11.3 M points into DuckDB using the following sql:
CREATE TABLE my_points AS
FROM 'input_data.csv';
INSTALL spatial;
LOAD spatial;
ALTER TABLE my_points ADD COLUMN geom GEOMETRY;
UPDATE my_points SET geom = ST_Point(geo_lon, geo_lat);To reproduce, you can download the input_data.csv from here.
Then, I set the threads to one and run the following sql:
SET threads TO 1;
EXPLAIN ANALYSE SELECT count(*) FROM my_points WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON ((38.364258 56.992883, 35.81543 55.739482, 37.705078 52.975108, 39.155273 54.622978, 44.494629 53.566414, 43.59375 56.255557, 38.364258 56.992883))'));here is the query profile:
┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││ Total Time: 1.06s ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌───────────────────────────┐
│ QUERY │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ EXPLAIN_ANALYZE │
│ ──────────────────── │
│ 0 Rows │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ UNGROUPED_AGGREGATE │
│ ──────────────────── │
│ Aggregates: │
│ count_star() │
│ │
│ 1 Rows │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ FILTER │
│ ──────────────────── │
│ ST_Intersects(geom, '\x02 │
│ \x04\x00\x00\x00\x00\x00 │
│ \x00\x00C\x0FB\x82\xE6SB │
│ \x81\xFA1B\xB7\xF8cB\x02 │
│ \x00\x00\x00\x01\x00\x00 │
│ \x00\x07\x00\x00\x00\x00 │
│ \x00\x00\x007\xA7\x92\x01 │
│ \xA0.C@\x8F\xE0F\xCA\x16 │
│ \x7FL@\xB1\x16\x9F\x02` │
│ \xE8A@\x87\xFD\x9EX\xA7 │
│ \xDEK@\x86\x90\xF3\xFE? │
│ \xDAB@\xB3\x08\xC5V\xD0|J@│
│ \xD5yT\xFC\xDF\x93C@R\x10<│
│ \xBE\xBDOK@\x9CS\xC9\x00P │
│ ?F@\x8F\xFF\x02A\x80\xC8J@│
│ \x00\x00\x00\x00\x00\xCCE@│
│ \xC7\xA1~\x17\xB6 L@7\xA7 │
│ \x92\x01\xA0.C@\x8F\xE0F │
│ \xCA\x16\x7FL@'::GEOMETRY)│
│ │
│ 2061759 Rows │
│ (0.96s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ TABLE_SCAN │
│ ──────────────────── │
│ my_points │
│ │
│ Projections: geom │
│ │
│ 11358150 Rows │
│ (0.10s) │
└───────────────────────────┘
Next I make an RTree index using:
CREATE INDEX my_idx ON my_points USING RTREE (geom);and I run the same query again. here is the query profile with the index:
┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││ Total Time: 4.69s ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌───────────────────────────┐
│ QUERY │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ EXPLAIN_ANALYZE │
│ ──────────────────── │
│ 0 Rows │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ UNGROUPED_AGGREGATE │
│ ──────────────────── │
│ Aggregates: │
│ count_star() │
│ │
│ 1 Rows │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ FILTER │
│ ──────────────────── │
│ ST_Intersects(geom, '\x02 │
│ \x04\x00\x00\x00\x00\x00 │
│ \x00\x00C\x0FB\x82\xE6SB │
│ \x81\xFA1B\xB7\xF8cB\x02 │
│ \x00\x00\x00\x01\x00\x00 │
│ \x00\x07\x00\x00\x00\x00 │
│ \x00\x00\x007\xA7\x92\x01 │
│ \xA0.C@\x8F\xE0F\xCA\x16 │
│ \x7FL@\xB1\x16\x9F\x02` │
│ \xE8A@\x87\xFD\x9EX\xA7 │
│ \xDEK@\x86\x90\xF3\xFE? │
│ \xDAB@\xB3\x08\xC5V\xD0|J@│
│ \xD5yT\xFC\xDF\x93C@R\x10<│
│ \xBE\xBDOK@\x9CS\xC9\x00P │
│ ?F@\x8F\xFF\x02A\x80\xC8J@│
│ \x00\x00\x00\x00\x00\xCCE@│
│ \xC7\xA1~\x17\xB6 L@7\xA7 │
│ \x92\x01\xA0.C@\x8F\xE0F │
│ \xCA\x16\x7FL@'::GEOMETRY)│
│ │
│ 2061759 Rows │
│ (0.52s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ TABLE_SCAN │
│ ──────────────────── │
│ my_points (RTREE INDEX │
│ SCAN : my_idx) │
│ │
│ Projections: geom │
│ │
│ 2394763 Rows │
│ (4.17s) │
└───────────────────────────┘
I am wondering if this a bug in RTree, or I am using the spatial extension wrong.
The other strange thing that I observed is that if I run the same sql when the DuckDB session that is persisted (i.e., DuckDB started with a database file), then the non-indexed query performs the same, but the query that uses the index is even slower. here is a query profile for the same query when the database is persisted:
┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││ Total Time: 7.58s ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌───────────────────────────┐
│ QUERY │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ EXPLAIN_ANALYZE │
│ ──────────────────── │
│ 0 Rows │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ UNGROUPED_AGGREGATE │
│ ──────────────────── │
│ Aggregates: │
│ count_star() │
│ │
│ 1 Rows │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ FILTER │
│ ──────────────────── │
│ ST_Intersects(geom, '\x02 │
│ \x04\x00\x00\x00\x00\x00 │
│ \x00\x00C\x0FB\x82\xE6SB │
│ \x81\xFA1B\xB7\xF8cB\x02 │
│ \x00\x00\x00\x01\x00\x00 │
│ \x00\x07\x00\x00\x00\x00 │
│ \x00\x00\x007\xA7\x92\x01 │
│ \xA0.C@\x8F\xE0F\xCA\x16 │
│ \x7FL@\xB1\x16\x9F\x02` │
│ \xE8A@\x87\xFD\x9EX\xA7 │
│ \xDEK@\x86\x90\xF3\xFE? │
│ \xDAB@\xB3\x08\xC5V\xD0|J@│
│ \xD5yT\xFC\xDF\x93C@R\x10<│
│ \xBE\xBDOK@\x9CS\xC9\x00P │
│ ?F@\x8F\xFF\x02A\x80\xC8J@│
│ \x00\x00\x00\x00\x00\xCCE@│
│ \xC7\xA1~\x17\xB6 L@7\xA7 │
│ \x92\x01\xA0.C@\x8F\xE0F │
│ \xCA\x16\x7FL@'::GEOMETRY)│
│ │
│ 2061759 Rows │
│ (0.24s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ TABLE_SCAN │
│ ──────────────────── │
│ my_points (RTREE INDEX │
│ SCAN : my_idx) │
│ │
│ Projections: geom │
│ │
│ 2394763 Rows │
│ (7.34s) │
└───────────────────────────┘
As I have enough RAM that the index can totally fit in the RAM, I don't understand why it's slower when I run it when database is persisted.
If this is a specific scenario where an RTree index is not preferred, I think it might be best to detect it in query planning and don't use an index for that query.
Version of DuckDB: v1.1.2 f680b7d08f
p.s.: I have 128GB of RAM in my computer and I set the threads specifically to 1, otherwise the non-indexed query is even faster cause it processes the data in parallel while the query plan that uses the RTree index is being executed sequentially.