Skip to content

RTree index results in significantly slower execution #444

@Meysam-

Description

@Meysam-

I load a dataset of ~11.3 M points into DuckDB using the following sql:

CREATE TABLE my_points AS
    FROM 'input_data.csv';

INSTALL spatial;
LOAD spatial;

ALTER TABLE my_points ADD COLUMN geom GEOMETRY;
UPDATE my_points SET geom = ST_Point(geo_lon, geo_lat);

To reproduce, you can download the input_data.csv from here.

Then, I set the threads to one and run the following sql:

SET threads TO 1;

EXPLAIN ANALYSE SELECT count(*) FROM my_points WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON ((38.364258 56.992883, 35.81543 55.739482, 37.705078 52.975108, 39.155273 54.622978, 44.494629 53.566414, 43.59375 56.255557, 38.364258 56.992883))'));

here is the query profile:

┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││               Total Time: 1.06s              ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌───────────────────────────┐
│           QUERY           │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      EXPLAIN_ANALYZE      │
│    ────────────────────   │
│           0 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│    UNGROUPED_AGGREGATE    │
│    ────────────────────   │
│        Aggregates:        │
│        count_star()       │
│                           │
│           1 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           FILTER          │
│    ────────────────────   │
│ ST_Intersects(geom, '\x02 │
│  \x04\x00\x00\x00\x00\x00 │
│  \x00\x00C\x0FB\x82\xE6SB │
│  \x81\xFA1B\xB7\xF8cB\x02 │
│  \x00\x00\x00\x01\x00\x00 │
│  \x00\x07\x00\x00\x00\x00 │
│ \x00\x00\x007\xA7\x92\x01 │
│  \xA0.C@\x8F\xE0F\xCA\x16 │
│  \x7FL@\xB1\x16\x9F\x02`  │
│  \xE8A@\x87\xFD\x9EX\xA7  │
│  \xDEK@\x86\x90\xF3\xFE?  │
│ \xDAB@\xB3\x08\xC5V\xD0|J@│
│ \xD5yT\xFC\xDF\x93C@R\x10<│
│ \xBE\xBDOK@\x9CS\xC9\x00P │
│ ?F@\x8F\xFF\x02A\x80\xC8J@│
│ \x00\x00\x00\x00\x00\xCCE@│
│ \xC7\xA1~\x17\xB6 L@7\xA7 │
│  \x92\x01\xA0.C@\x8F\xE0F │
│ \xCA\x16\x7FL@'::GEOMETRY)│
│                           │
│        2061759 Rows       │
│          (0.96s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         TABLE_SCAN        │
│    ────────────────────   │
│         my_points         │
│                           │
│     Projections: geom     │
│                           │
│       11358150 Rows       │
│          (0.10s)          │
└───────────────────────────┘

Next I make an RTree index using:

CREATE INDEX my_idx ON my_points USING RTREE (geom);

and I run the same query again. here is the query profile with the index:

┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││               Total Time: 4.69s              ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌───────────────────────────┐
│           QUERY           │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      EXPLAIN_ANALYZE      │
│    ────────────────────   │
│           0 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│    UNGROUPED_AGGREGATE    │
│    ────────────────────   │
│        Aggregates:        │
│        count_star()       │
│                           │
│           1 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           FILTER          │
│    ────────────────────   │
│ ST_Intersects(geom, '\x02 │
│  \x04\x00\x00\x00\x00\x00 │
│  \x00\x00C\x0FB\x82\xE6SB │
│  \x81\xFA1B\xB7\xF8cB\x02 │
│  \x00\x00\x00\x01\x00\x00 │
│  \x00\x07\x00\x00\x00\x00 │
│ \x00\x00\x007\xA7\x92\x01 │
│  \xA0.C@\x8F\xE0F\xCA\x16 │
│  \x7FL@\xB1\x16\x9F\x02`  │
│  \xE8A@\x87\xFD\x9EX\xA7  │
│  \xDEK@\x86\x90\xF3\xFE?  │
│ \xDAB@\xB3\x08\xC5V\xD0|J@│
│ \xD5yT\xFC\xDF\x93C@R\x10<│
│ \xBE\xBDOK@\x9CS\xC9\x00P │
│ ?F@\x8F\xFF\x02A\x80\xC8J@│
│ \x00\x00\x00\x00\x00\xCCE@│
│ \xC7\xA1~\x17\xB6 L@7\xA7 │
│  \x92\x01\xA0.C@\x8F\xE0F │
│ \xCA\x16\x7FL@'::GEOMETRY)│
│                           │
│        2061759 Rows       │
│          (0.52s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         TABLE_SCAN        │
│    ────────────────────   │
│   my_points (RTREE INDEX  │
│       SCAN : my_idx)      │
│                           │
│     Projections: geom     │
│                           │
│        2394763 Rows       │
│          (4.17s)          │
└───────────────────────────┘

I am wondering if this a bug in RTree, or I am using the spatial extension wrong.
The other strange thing that I observed is that if I run the same sql when the DuckDB session that is persisted (i.e., DuckDB started with a database file), then the non-indexed query performs the same, but the query that uses the index is even slower. here is a query profile for the same query when the database is persisted:

┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││               Total Time: 7.58s              ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌───────────────────────────┐
│           QUERY           │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      EXPLAIN_ANALYZE      │
│    ────────────────────   │
│           0 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│    UNGROUPED_AGGREGATE    │
│    ────────────────────   │
│        Aggregates:        │
│        count_star()       │
│                           │
│           1 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           FILTER          │
│    ────────────────────   │
│ ST_Intersects(geom, '\x02 │
│  \x04\x00\x00\x00\x00\x00 │
│  \x00\x00C\x0FB\x82\xE6SB │
│  \x81\xFA1B\xB7\xF8cB\x02 │
│  \x00\x00\x00\x01\x00\x00 │
│  \x00\x07\x00\x00\x00\x00 │
│ \x00\x00\x007\xA7\x92\x01 │
│  \xA0.C@\x8F\xE0F\xCA\x16 │
│  \x7FL@\xB1\x16\x9F\x02`  │
│  \xE8A@\x87\xFD\x9EX\xA7  │
│  \xDEK@\x86\x90\xF3\xFE?  │
│ \xDAB@\xB3\x08\xC5V\xD0|J@│
│ \xD5yT\xFC\xDF\x93C@R\x10<│
│ \xBE\xBDOK@\x9CS\xC9\x00P │
│ ?F@\x8F\xFF\x02A\x80\xC8J@│
│ \x00\x00\x00\x00\x00\xCCE@│
│ \xC7\xA1~\x17\xB6 L@7\xA7 │
│  \x92\x01\xA0.C@\x8F\xE0F │
│ \xCA\x16\x7FL@'::GEOMETRY)│
│                           │
│        2061759 Rows       │
│          (0.24s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         TABLE_SCAN        │
│    ────────────────────   │
│   my_points (RTREE INDEX  │
│       SCAN : my_idx)      │
│                           │
│     Projections: geom     │
│                           │
│        2394763 Rows       │
│          (7.34s)          │
└───────────────────────────┘

As I have enough RAM that the index can totally fit in the RAM, I don't understand why it's slower when I run it when database is persisted.
If this is a specific scenario where an RTree index is not preferred, I think it might be best to detect it in query planning and don't use an index for that query.

Version of DuckDB: v1.1.2 f680b7d08f

p.s.: I have 128GB of RAM in my computer and I set the threads specifically to 1, otherwise the non-indexed query is even faster cause it processes the data in parallel while the query plan that uses the RTree index is being executed sequentially.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions