Optimize csv data exports #1480

camallen · 2022-12-08T10:49:15Z

when attempting to export data from large tables the current queries are hitting the statement timeouts added in #1465

this is due to the current queries (like extract exports) not using the indexes on the workflow_id column as it's not highly selective (only 649 different values over ~254284520 rows) so the query planner chooses to use the index scan over the PK (effectively a seq scan) with a filter to extract the rows that match the workflow we're exporting. These queries don't resolve in a timely manner... (query plan below)

explain SELECT  "extracts".* FROM "extracts" WHERE workflow_id = 15831 ORDER BY id asc LIMIT 1000

"Limit  (cost=0.57..6086.81 rows=1000 width=153)"
"  ->  Index Scan using extracts_pkey on extracts  (cost=0.57..11917006.67 rows=1958024 width=153)"
"        Filter: (workflow_id = 15831)"

As such this PR modifies the current export queries to use a CTE as an 'optimization fence' that forces the query planner to use the relevant indexes to resolve the CTE and then we can page into that data using the .find_each AR batching system. This means the CTE subquery will resolve on each .find_each query call which seems to be fine for the example data load i was testing against ~2M rows (wf_id = 15831) but may not work well for larger numbers of extracts... time will tell.

I've benchmarked these queries on prod data loads and these queries seem to resolve in decent time frames (they are much better than the current approach)

explain analyze WITH "extracts" AS (
	SELECT "extracts".*
	FROM "extracts"
	WHERE "extracts"."workflow_id" = 15831
) 
SELECT  "extracts".* 
FROM "extracts" 
ORDER BY "extracts"."id" ASC LIMIT 1000 offset 10000

"Limit  (cost=1048640.03..1048642.53 rows=1000 width=145) (actual time=3338.674..3338.761 rows=1000 loops=1)"
"  CTE extracts"
"    ->  Index Scan using index_extracts_on_workflow_id on extracts extracts_1  (cost=0.57..868206.01 rows=1958284 width=153) (actual time=0.029..863.242 rows=1947860 loops=1)"
"          Index Cond: (workflow_id = 15831)"
"  ->  Sort  (cost=180409.03..185304.74 rows=1958284 width=145) (actual time=3338.076..3338.399 rows=11000 loops=1)"
"        Sort Key: extracts.id"
"        Sort Method: top-N heapsort  Memory: 3862kB"
"        ->  CTE Scan on extracts  (cost=0.00..39165.68 rows=1958284 width=145) (actual time=0.031..2319.238 rows=1947860 loops=1)"
"Planning Time: 0.174 ms"
"Execution Time: 3400.993 ms"

and the column header query

explain analyze WITH "extracts" AS (
	SELECT DISTINCT(jsonb_object_keys(extracts.data)) AS key 
	FROM "extracts" 
	WHERE "extracts"."workflow_id" = 15831 
	AND (jsonb_typeof(extracts.data)='object')
) SELECT "key" FROM "extracts"

"CTE Scan on extracts  (cost=865610.10..865805.92 rows=9791 width=32) (actual time=1736.704..2134.302 rows=5 loops=1)"
"  CTE extracts"
"    ->  Unique  (cost=865561.15..865610.10 rows=9791 width=32) (actual time=1736.702..2134.294 rows=5 loops=1)"
"          ->  Sort  (cost=865561.15..865585.63 rows=9791 width=32) (actual time=1736.701..2021.331 rows=1002284 loops=1)"
"                Sort Key: (jsonb_object_keys(extracts_1.data))"
"                Sort Method: external merge  Disk: 11760kB"
"                ->  Gather  (cost=1000.57..864912.14 rows=9791 width=32) (actual time=5.130..1065.904 rows=1002284 loops=1)"
"                      Workers Planned: 2"
"                      Workers Launched: 2"
"                      ->  ProjectSet  (cost=0.57..862933.04 rows=408000 width=32) (actual time=0.168..809.038 rows=334095 loops=3)"
"                            ->  Parallel Index Scan using index_extracts_on_workflow_id on extracts extracts_1  (cost=0.57..860862.44 rows=4080 width=88) (actual time=0.118..502.210 rows=649287 loops=3)"
"                                  Index Cond: (workflow_id = 15831)"
"                                  Filter: (jsonb_typeof(data) = 'object'::text)"
"Planning Time: 0.299 ms"
"Execution Time: 2138.703 ms"

allow us to create cte queries

ensure the query filtering works as expected on the optimized CTE export scope queries

…count

use optimized CTE queries to force the use of indexes when exporting data

yuenmichelle1

LGTM

lcjohnso

LGTM! Great improvement on the execution time.

Non-blocking question: was the active_record_extended gem added to enable this work, or part of broader dependency updates included here? How is it used? To allow for use of with()? No problem either way, just trying to understand the role of a new dependency.

lcjohnso · 2022-12-08T22:11:57Z

One special case of export creation (that doesn't seem to be tested) is Intro2Astro exports, which uses a subgroup parameter to filter down to a subset of rows in the exported reduction. The original PR enabling this feature is #209.

I don't think the optimization work here directly affects this (that logic appears in the DataRequests controller), so we should be OK. But it would be helpful to add a test (either here, or issued up as future work) for this special use case.

camallen · 2022-12-08T22:59:23Z

Non-blocking question: was the active_record_extended gem added to enable this work, or part of broader dependency updates included here? How is it used? To allow for use of with()? No problem either way, just trying to understand the role of a new dependency.

Yes - this gem allows us to write a CTE .with statment in Active record syntax. https://github.com/georgekaraszi/ActiveRecordExtended#common-table-expressions-cte

I should have included that and the link to the panoptes version zooniverse/panoptes#3962. Other updates were to get the tests / dev env up to date.

One special case of export creation (that doesn't seem to be tested) is Intro2Astro exports, which uses a subgroup parameter to filter down to a subset of rows in the exported reduction. The original PR enabling this feature is #209.
I don't think the optimization work here directly affects this (that logic appears in the DataRequests controller), so we should be OK. But it would be helpful to add a test (either here, or issued up as future work) for this special use case.

These add extra filters on the where clause for the export type and they are still included in this PR at https://github.com/zooniverse/caesar/pull/1480/files#diff-3319898e2620a07a37c8172538782c5e4566a18e0d38d7417e590cc0ae091f6bR83

None of it is tested well - the csv export class has 0 existing tests which is strange for quite a complex bit of functionality. It's is tested via the data request worker but the specific filtering isn't. If i do any work on this again i'll look to add specific tests for the csv exporter class and the AR resource scope filtering.

camallen added 4 commits December 8, 2022 20:34

update dev / test gems and add active_record_extended

f88ae3e

allow us to create cte queries

add specs on the data request worker for cte query

df5f09c

ensure the query filtering works as expected on the optimized CTE export scope queries

refactor the data request worker to always save the exported records …

8678f3d

…count

add optimized CTE queries and refactor the csv exporter

db1874a

use optimized CTE queries to force the use of indexes when exporting data

camallen requested review from lcjohnso and yuenmichelle1 December 8, 2022 10:49

add space after comma

ac1d165

yuenmichelle1 approved these changes Dec 8, 2022

View reviewed changes

github-actions bot added the approved label Dec 8, 2022

lcjohnso approved these changes Dec 8, 2022

View reviewed changes

camallen mentioned this pull request Dec 8, 2022

add tests for csv_exporter.rb class #1481

Open

camallen merged commit 6a0f8b5 into master Dec 8, 2022

camallen deleted the optimize-extract-exports branch December 8, 2022 23:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize csv data exports #1480

Optimize csv data exports #1480

Uh oh!

camallen commented Dec 8, 2022

Uh oh!

yuenmichelle1 left a comment

Uh oh!

lcjohnso left a comment

Uh oh!

lcjohnso commented Dec 8, 2022

Uh oh!

camallen commented Dec 8, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Optimize csv data exports #1480

Optimize csv data exports #1480

Uh oh!

Conversation

camallen commented Dec 8, 2022

Uh oh!

yuenmichelle1 left a comment

Choose a reason for hiding this comment

Uh oh!

lcjohnso left a comment

Choose a reason for hiding this comment

Uh oh!

lcjohnso commented Dec 8, 2022

Uh oh!

camallen commented Dec 8, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants