-
Notifications
You must be signed in to change notification settings - Fork 307
Add workaround so dask arrays are optimized in Delayed writing #3082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add workaround so dask arrays are optimized in Delayed writing #3082
Conversation
This speeds up writers like "simple_image" significantly.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #3082 +/- ##
=======================================
Coverage 96.14% 96.14%
=======================================
Files 383 383
Lines 55798 55800 +2
=======================================
+ Hits 53649 53651 +2
Misses 2149 2149
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Pull Request Test Coverage Report for Build 13913813285Details
💛 - Coveralls |
For my own understanding, you say memory use has increased slightly due to more parallel tasks - does that mean that a user could, if needed, reduce memory use again by limiting the number of parallel tasks? |
Yes, but that would be done in the traditional way of limiting number oof dask workers or reducing chunk size. This PR is just making it more likely that more mini-tasks (a numpy function call) are happening during one merged task and using more memory. That's my best guess at least. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, makes sense
FYI recent changes in dask should make this unnecessary. That said I was advised to replace Array -> Delayed type operations with equivalent array reductions (map_blocks, reduce, etc). I haven't gotten an answer yet to an exact replacement for some of the use cases in Satpy. |
This speeds up writers like "simple_image" significantly.
This is the simplest workaround I could think of that works for all cases. This addresses the problem described in my dask discourse discussion post here.
Essentially we're saying "I know this is a Delayed object, but look at the task graph as a series of Array operations, not basic Delayed tasks". This means more complex graph optimizations are performed by dask rather than only the simple culling style optimization done for Delayed objects.
Note: This does not fix the rare issue caused by
da.store
pre-optimizing for Arrays and satpy users combining multiple writing (.save_datasets
) calls into a single call usingcompute_writer_results
. In some cases like this some Array tasks are re-computed because the pre-optimized tasks were renamed/merged/fused separately and dask thinks they are separate unique tasks. This comes down to dask/dask#8380 and dask/dask#9732 and my discussion post linked above and that they are not resolved upstream (yet 🤞 ).As noted on slack, for a simple singe MODIS band to PNG this sped things up from 54s to 39s with only a slight but expected increase in memory as more tasks are computed in parallel and in bigger groups of operations...or at least I think that's why.
AUTHORS.md
if not there already