Skip to content

Conversation

@camallen
Copy link
Contributor

@camallen camallen commented Dec 8, 2022

when attempting to export data from large tables the current queries are hitting the statement timeouts added in #1465

this is due to the current queries (like extract exports) not using the indexes on the workflow_id column as it's not highly selective (only 649 different values over ~254284520 rows) so the query planner chooses to use the index scan over the PK (effectively a seq scan) with a filter to extract the rows that match the workflow we're exporting. These queries don't resolve in a timely manner... (query plan below)

explain SELECT  "extracts".* FROM "extracts" WHERE workflow_id = 15831 ORDER BY id asc LIMIT 1000

"Limit  (cost=0.57..6086.81 rows=1000 width=153)"
"  ->  Index Scan using extracts_pkey on extracts  (cost=0.57..11917006.67 rows=1958024 width=153)"
"        Filter: (workflow_id = 15831)"

As such this PR modifies the current export queries to use a CTE as an 'optimization fence' that forces the query planner to use the relevant indexes to resolve the CTE and then we can page into that data using the .find_each AR batching system. This means the CTE subquery will resolve on each .find_each query call which seems to be fine for the example data load i was testing against ~2M rows (wf_id = 15831) but may not work well for larger numbers of extracts... time will tell.

I've benchmarked these queries on prod data loads and these queries seem to resolve in decent time frames (they are much better than the current approach)

explain analyze WITH "extracts" AS (
	SELECT "extracts".*
	FROM "extracts"
	WHERE "extracts"."workflow_id" = 15831
) 
SELECT  "extracts".* 
FROM "extracts" 
ORDER BY "extracts"."id" ASC LIMIT 1000 offset 10000

"Limit  (cost=1048640.03..1048642.53 rows=1000 width=145) (actual time=3338.674..3338.761 rows=1000 loops=1)"
"  CTE extracts"
"    ->  Index Scan using index_extracts_on_workflow_id on extracts extracts_1  (cost=0.57..868206.01 rows=1958284 width=153) (actual time=0.029..863.242 rows=1947860 loops=1)"
"          Index Cond: (workflow_id = 15831)"
"  ->  Sort  (cost=180409.03..185304.74 rows=1958284 width=145) (actual time=3338.076..3338.399 rows=11000 loops=1)"
"        Sort Key: extracts.id"
"        Sort Method: top-N heapsort  Memory: 3862kB"
"        ->  CTE Scan on extracts  (cost=0.00..39165.68 rows=1958284 width=145) (actual time=0.031..2319.238 rows=1947860 loops=1)"
"Planning Time: 0.174 ms"
"Execution Time: 3400.993 ms"

and the column header query

explain analyze WITH "extracts" AS (
	SELECT DISTINCT(jsonb_object_keys(extracts.data)) AS key 
	FROM "extracts" 
	WHERE "extracts"."workflow_id" = 15831 
	AND (jsonb_typeof(extracts.data)='object')
) SELECT "key" FROM "extracts"

"CTE Scan on extracts  (cost=865610.10..865805.92 rows=9791 width=32) (actual time=1736.704..2134.302 rows=5 loops=1)"
"  CTE extracts"
"    ->  Unique  (cost=865561.15..865610.10 rows=9791 width=32) (actual time=1736.702..2134.294 rows=5 loops=1)"
"          ->  Sort  (cost=865561.15..865585.63 rows=9791 width=32) (actual time=1736.701..2021.331 rows=1002284 loops=1)"
"                Sort Key: (jsonb_object_keys(extracts_1.data))"
"                Sort Method: external merge  Disk: 11760kB"
"                ->  Gather  (cost=1000.57..864912.14 rows=9791 width=32) (actual time=5.130..1065.904 rows=1002284 loops=1)"
"                      Workers Planned: 2"
"                      Workers Launched: 2"
"                      ->  ProjectSet  (cost=0.57..862933.04 rows=408000 width=32) (actual time=0.168..809.038 rows=334095 loops=3)"
"                            ->  Parallel Index Scan using index_extracts_on_workflow_id on extracts extracts_1  (cost=0.57..860862.44 rows=4080 width=88) (actual time=0.118..502.210 rows=649287 loops=3)"
"                                  Index Cond: (workflow_id = 15831)"
"                                  Filter: (jsonb_typeof(data) = 'object'::text)"
"Planning Time: 0.299 ms"
"Execution Time: 2138.703 ms"

ensure the query filtering works as expected on the optimized CTE export scope queries
use optimized CTE queries to force the use of indexes when exporting data
Copy link
Contributor

@yuenmichelle1 yuenmichelle1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@lcjohnso lcjohnso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great improvement on the execution time.

Non-blocking question: was the active_record_extended gem added to enable this work, or part of broader dependency updates included here? How is it used? To allow for use of with()? No problem either way, just trying to understand the role of a new dependency.

@lcjohnso
Copy link
Member

lcjohnso commented Dec 8, 2022

One special case of export creation (that doesn't seem to be tested) is Intro2Astro exports, which uses a subgroup parameter to filter down to a subset of rows in the exported reduction. The original PR enabling this feature is #209.

I don't think the optimization work here directly affects this (that logic appears in the DataRequests controller), so we should be OK. But it would be helpful to add a test (either here, or issued up as future work) for this special use case.

@camallen
Copy link
Contributor Author

camallen commented Dec 8, 2022

Non-blocking question: was the active_record_extended gem added to enable this work, or part of broader dependency updates included here? How is it used? To allow for use of with()? No problem either way, just trying to understand the role of a new dependency.

Yes - this gem allows us to write a CTE .with statment in Active record syntax. https://github.com/georgekaraszi/ActiveRecordExtended#common-table-expressions-cte

I should have included that and the link to the panoptes version zooniverse/panoptes#3962. Other updates were to get the tests / dev env up to date.

One special case of export creation (that doesn't seem to be tested) is Intro2Astro exports, which uses a subgroup parameter to filter down to a subset of rows in the exported reduction. The original PR enabling this feature is #209.
I don't think the optimization work here directly affects this (that logic appears in the DataRequests controller), so we should be OK. But it would be helpful to add a test (either here, or issued up as future work) for this special use case.

These add extra filters on the where clause for the export type and they are still included in this PR at https://github.com/zooniverse/caesar/pull/1480/files#diff-3319898e2620a07a37c8172538782c5e4566a18e0d38d7417e590cc0ae091f6bR83

None of it is tested well - the csv export class has 0 existing tests which is strange for quite a complex bit of functionality. It's is tested via the data request worker but the specific filtering isn't. If i do any work on this again i'll look to add specific tests for the csv exporter class and the AR resource scope filtering.

@camallen camallen merged commit 6a0f8b5 into master Dec 8, 2022
@camallen camallen deleted the optimize-extract-exports branch December 8, 2022 23:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants