Change to using unbuffered queries for data exports. #2085
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR changes the Data Export batch processor to use unbuffered queries, which makes it use less memory on the web server.
Because the query is unbuffered, the number of rows can no longer be counted until the query has completed, so this PR removes the debugging statement that prints the row count.
This PR also removes the lines from the raw data REST endpoint in
WarehouseControllerProvider
that set the unbuffered query mode since this is now handled by theBatchDataset
code that is used by both the raw data endpoint and the Data Export batch processor.Motivation and Context
This fixes a bug in which, if a large amount of raw data are requested to be exported, the batch processor can run out of memory on the web server.
Tests performed
On my developer port of
xdmod-dev
, I added a debugging message to the end ofbatch_export_manager.php
that prints the value ofmemory_get_peak_usage()
, and I made various sizes of Data Export requests and ran the script. For the old buffered query, the peak memory usage scaled up as the number of days to export increased. For the new unbuffered query, the peak memory usage stayed at around 6–7MB even as the number of days to export increased.There had been a data export on prod that caused it to run out of memory and crash; I tried this same export on my dev port and confirmed it worked. The parameters were:
I also tested to make sure the
/rest/warehouse/raw-data
endpoint still works.Checklist: