WIP: gracefully handle cancellation, allow running in parallel #60

micha030201 · 2025-07-01T15:01:29Z

I am using pyOSOAA to run a lot of simulations, and I would really like to parallelize them to use all of my computer cores. There are two problems I found while doing that which I'm trying to address. They are similar in nature, so I'm clumping them into one pull request.

Firstly, pyOSOAA creates the result directories before running the OSOAA code, so before there are actually any results. As such, if the program is interrupted, or if OSOAA errors out for some reason, the next time pyOSOAA is run with the same parameters it will see that the results directory for that hash is already there, will try to read them without running OSOAA, and will fail because while the directory exists, there results there either don't exist or are incomplete. The errors look like this:

  File "/<...>/pyOSOAA/osoaa.py", line 1403, in test
    s.run()
    ~~~~~^^
  File "/<...>/pyOSOAA/osoaa.py", line 1391, in run
    self.outputs = OUTPUTS(self.resroot, self.results)
                   ~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<...>/pyOSOAA/outputs.py", line 497, in __init__
    self.vsvza = VSVZA(resroot, filenames.vsvza)
                 ~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<...>/pyOSOAA/outputs.py", line 68, in __init__
    with open(os.path.join(resroot, 'Standard_outputs', filename),
         ~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
              encoding="iso-8859-15") as file:
              ^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/<...>/results/bf7fcad502c7c7f59488d4d8bff968b7/Standard_outputs/LUM_vsVZA.txt'

Secondly, OSOAA itself stores aerosol Mie, hydrosol Mie, and surface files (AER.DirMie, HYD.DirMie, SEA.Dir) respectively. The files are created the first time the simulation is run with some aerosol, hydrosol, or wind parameters, and then reused if the simulation is run again with those parameters. If OSOAA is interrupted while creating those files, they will be corrupted and it will not be able to read them next time, resulting in an error. Additionally, when running OSOAA in parallel, one process might try to read or write the file another process is currently writing, resulting in the same error. The errors look like this:

 ==> Angles calculation
 ==> Aerosols radiative properties computation
 Aerosols --> Mie files repertory : /<...>/DATABASE/MIE_AER
 Aerosols --> Mie files repertory : /<...>/DATABASE/MIE_AER
 ==> Hydrosols radiative properties computation
 Hydrosols --> Mie files repertory : /<...>/DATABASE/MIE_HYD
 ==> Atmospheric and sea profiles computation
 ==> Sea / atmosphere interface matrices computation
 Surface matrices repertory : /<...>/DATABASE/SURF_MATR
 Matrix RAA : RAA-1.340-10.0-RadMU48-NB80-SZA60.000-TSZA40.262
 -- RAA Matrix file is being calculated
   OSOAA_MISE_FORMAT : ERROR_991 on a file opening
   OSOAA_SURFACE_CASE : ERROR_998  
       on subroutine OSOAA_MISE_FORMAT
       for case : RAA
   OSOAA_SURFACE : ERROR_3000  
       on subroutine OSOAA_SURFACE_RAA
   OSOAA_MAIN : ERROR_8000 on subroutine OSOAA_SURFACE

I have fixed the first problem by creating temporary directories (using python's tempfile module) and copying the results into the final results cache directory after the execution has finished and the results were successfully read by pyOSOAA. The options relating to the caching behavior (forcerun, cleanup) should still function as normal.

I have not yet finished addressing the second problem. OSOAA can't be instructed to create those files in a different directory than the one it reads them from, so it is not possible to have it read the existing database, save new files into the temporary directory, and then copy them into the database. A potential solution would be to copy the entire database each time OSOAA is run and then add the new files (if any) back to the main database, but this is undesirable because the database can grow quite large. Currently I'm thinking about checking the parameters the cached files depend on, "predicting" the files that will be necessary, and copying just those files to the temporary directory, if they exist. For example, here is a description of what the surface file names are based on (in French). Some of them (sea index, wind speed) are just input parameters, and the rest can hopefully be calculated easily from the input parameters. I will investigate that further.

I have also refactored the code a little bit to use python's subprocess module instead of creating a shell script and running it, removing the dependency on having ksh installed on the system, and have also improved error reporting. I have not tested it on Windows yet, but will do so before submitting this pull request as finished work.

I am submitting this pull request now as a work-in-progress. Please tell me if you have any comments or insights into this matter, or if there are any changes you would like me to make to eventually get this merged. Hopefully I can help make pyOSOAA better and easier to use, including in cases of parallel execution and possible cancellation.

Cheers, and thank you for your work!

micha030201 added 2 commits June 29, 2025 16:42

some error handling

4b9ea1b

use temporary directories

3749ccc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: gracefully handle cancellation, allow running in parallel #60

WIP: gracefully handle cancellation, allow running in parallel #60

Uh oh!

micha030201 commented Jul 1, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

WIP: gracefully handle cancellation, allow running in parallel #60

Are you sure you want to change the base?

WIP: gracefully handle cancellation, allow running in parallel #60

Uh oh!

Conversation

micha030201 commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

micha030201 commented Jul 1, 2025 •

edited

Loading