WIP: gracefully handle cancellation, allow running in parallel #60
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I am using pyOSOAA to run a lot of simulations, and I would really like to parallelize them to use all of my computer cores. There are two problems I found while doing that which I'm trying to address. They are similar in nature, so I'm clumping them into one pull request.
Firstly, pyOSOAA creates the result directories before running the OSOAA code, so before there are actually any results. As such, if the program is interrupted, or if OSOAA errors out for some reason, the next time pyOSOAA is run with the same parameters it will see that the results directory for that hash is already there, will try to read them without running OSOAA, and will fail because while the directory exists, there results there either don't exist or are incomplete. The errors look like this:
Secondly, OSOAA itself stores aerosol Mie, hydrosol Mie, and surface files (
AER.DirMie,HYD.DirMie,SEA.Dir) respectively. The files are created the first time the simulation is run with some aerosol, hydrosol, or wind parameters, and then reused if the simulation is run again with those parameters. If OSOAA is interrupted while creating those files, they will be corrupted and it will not be able to read them next time, resulting in an error. Additionally, when running OSOAA in parallel, one process might try to read or write the file another process is currently writing, resulting in the same error. The errors look like this:I have fixed the first problem by creating temporary directories (using python's tempfile module) and copying the results into the final
resultscache directory after the execution has finished and the results were successfully read by pyOSOAA. The options relating to the caching behavior (forcerun, cleanup) should still function as normal.I have not yet finished addressing the second problem. OSOAA can't be instructed to create those files in a different directory than the one it reads them from, so it is not possible to have it read the existing database, save new files into the temporary directory, and then copy them into the database. A potential solution would be to copy the entire database each time OSOAA is run and then add the new files (if any) back to the main database, but this is undesirable because the database can grow quite large. Currently I'm thinking about checking the parameters the cached files depend on, "predicting" the files that will be necessary, and copying just those files to the temporary directory, if they exist. For example, here is a description of what the surface file names are based on (in French). Some of them (sea index, wind speed) are just input parameters, and the rest can hopefully be calculated easily from the input parameters. I will investigate that further.
I have also refactored the code a little bit to use python's subprocess module instead of creating a shell script and running it, removing the dependency on having ksh installed on the system, and have also improved error reporting. I have not tested it on Windows yet, but will do so before submitting this pull request as finished work.
I am submitting this pull request now as a work-in-progress. Please tell me if you have any comments or insights into this matter, or if there are any changes you would like me to make to eventually get this merged. Hopefully I can help make pyOSOAA better and easier to use, including in cases of parallel execution and possible cancellation.
Cheers, and thank you for your work!