-
Notifications
You must be signed in to change notification settings - Fork 12
Optimize csv data exports #1480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
allow us to create cte queries
ensure the query filtering works as expected on the optimized CTE export scope queries
use optimized CTE queries to force the use of indexes when exporting data
yuenmichelle1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
lcjohnso
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Great improvement on the execution time.
Non-blocking question: was the active_record_extended gem added to enable this work, or part of broader dependency updates included here? How is it used? To allow for use of with()? No problem either way, just trying to understand the role of a new dependency.
|
One special case of export creation (that doesn't seem to be tested) is Intro2Astro exports, which uses a I don't think the optimization work here directly affects this (that logic appears in the DataRequests controller), so we should be OK. But it would be helpful to add a test (either here, or issued up as future work) for this special use case. |
Yes - this gem allows us to write a CTE I should have included that and the link to the panoptes version zooniverse/panoptes#3962. Other updates were to get the tests / dev env up to date.
These add extra filters on the where clause for the export type and they are still included in this PR at https://github.com/zooniverse/caesar/pull/1480/files#diff-3319898e2620a07a37c8172538782c5e4566a18e0d38d7417e590cc0ae091f6bR83 None of it is tested well - the csv export class has 0 existing tests which is strange for quite a complex bit of functionality. It's is tested via the data request worker but the specific filtering isn't. If i do any work on this again i'll look to add specific tests for the csv exporter class and the AR resource scope filtering. |
when attempting to export data from large tables the current queries are hitting the statement timeouts added in #1465
this is due to the current queries (like extract exports) not using the indexes on the
workflow_idcolumn as it's not highly selective (only 649 different values over ~254284520 rows) so the query planner chooses to use the index scan over the PK (effectively a seq scan) with a filter to extract the rows that match the workflow we're exporting. These queries don't resolve in a timely manner... (query plan below)As such this PR modifies the current export queries to use a CTE as an 'optimization fence' that forces the query planner to use the relevant indexes to resolve the CTE and then we can page into that data using the
.find_eachAR batching system. This means the CTE subquery will resolve on each.find_eachquery call which seems to be fine for the example data load i was testing against ~2M rows (wf_id = 15831) but may not work well for larger numbers of extracts... time will tell.I've benchmarked these queries on prod data loads and these queries seem to resolve in decent time frames (they are much better than the current approach)
and the column header query