Add WorkflowTemplate run-fv3gfs for segmented long runs #576

oliverwm1 · 2020-08-19T15:34:08Z

Long fv3gfs runs generate large netCDFs that are difficult to post-process. Also runs cannot be restarted partway through if they fail. This PR introduces an argo WorkflowTemplate run-fv3gfs that can do segmented fv3gfs runs along with automated post-processing and appending of zarr outputs.

Limitations of current implementation:

time chunk size must evenly divide run length if doing more than one segment
Entrypoint argo workflow must define workdir volume and restart-data volume
The fv3config object cannot be changed by the user between segments (e.g. for the nudge-to-obs case, this means all analysis files for the entire simulation will need to be specified as assets for each segment. If these files are on a mounted volume, could point directly to them instead of downloading from GCS)

Added public API:

run-fv3gfs argo WorkflowTemplate
append_run.py script in post_process_run image

Significant internal changes:

post_process.py can now output to a local path
nudging WorkflowTemplate now uses run-fv3gfs. This has the added bonus that nudging run outputs will now be rechunked.
some small changes to e2e test to accommodate above change.

Requirement changes:

fsspec, gcsfs and pip are now installed in the post_process_run image
Tests added

Resolves [VCMML-446]

TODO:

Change append_run implementation so input is not modified (do work in tempdir)
Clean workdir as segmented run proceeds, so artifacts from previous segments don't get copied into later segments
Think about how this will work for nudge-to-obs (config object needs to change from segment to segment to get correct files, or prepare config script needs to make sure the analysis files for all segments are specified in the first config)
Handle time encoding for variables
Make nudging workflow use this, so that it gets exercised by e2e test
Ensure later segments don't run if a segment fails (probably need to switch to using recursion to handle this, see argo#3664)
Add docs
Figure out how to pass other mounted volumes to this (e.g. restarts or analysis files for nudging runs)
(optional) Move artifacts into more useful bins (restarts, config, logging?)

oliverwm1 · 2020-08-19T20:42:01Z

workflows/post_process_run/post_process.py

@@ -34,7 +34,7 @@ def get_chunks(user_chunks: ChunkSpec) -> ChunkSpec:


 def upload_dir(d, dest):
-    subprocess.check_call(["gsutil", "-m", "rsync", "-r", d, dest])
+    subprocess.check_call(["gsutil", "-m", "rsync", "-r", "-e", d, dest])


Don't follow symlinks.

This reverts commit 2c0f223.

spencerkclark

This looks very clean and the time logic all looks great. I mainly just had a few questions and renaming suggestions.

workflows/argo/README.md

workflows/post_process_run/append_run.py

workflows/argo/run-fv3gfs.yaml

workflows/post_process_run/append_run.py

workflows/post_process_run/test_append_run.py

nbren12

Thanks! This is a pretty sophisticated cloud workflow, and a good demonstration that we are on the bleeding edge of running models in the cloud. It's a big feature so I have a few requested changes.

The argo mostly looks good to me, but could benefit from clearer naming and documentation IMO. I made some suggestions below.

Likewise, the appending implementation looks good to me overall, but I am concerned that the test are too tightly scoped to the implementations details. Conversely, there isn't good coverage of some of the code in append_run. I think part of this issue is that the shift_store and set_time_units_like interface are pretty low-level. I think an append_zarr(dest_path, src_path) would be more convenient and testable.

nbren12 · 2020-08-25T18:18:18Z

tests/end_to_end_integration/argo.yaml

  volumes:
    - name: restart-data
      persistentVolumeClaim:
-        claimName: '{{workflow.parameters.restart-pvc}}'


Why did you remove this option? We had it so that we could change the name of the restart-pvc. I guess this hasn't changed yet, but it could.

I don't think the e2e workflow needs this to be parametrized. Agreed it should be for the nudging workflow, I'll change it there.

workflows/argo/README.md

workflows/argo/nudging/nudging.yaml

workflows/post_process_run/append_run.py

nbren12 · 2020-08-25T19:41:00Z

workflows/post_process_run/test_append_run.py

+
+
+@pytest.mark.parametrize("with_coords", [True, False])
+def test_appending_shifted_zarr_gives_expected_ds(tmpdir, with_coords):


Similarly, this test seems to test the integration of shift_store, _copy_tree, and some other lower-level interfaces. I suppose the append_run function is doing something similar, but that implementation could change, so it's not clear if this test is actually testing something relevant.

If this combination of units is important, then can you create a single public API, test it, and then use it inside of append_run?

Yeah, the reason for the complexity of this test (and lack of a single public API for this functionality) is that I wanted to have only one upload_dir call in append_run, instead of uploading each zarr (and the other items) in serial. As well, I was trying to avoid interaction with GCS in the test.

I guess if there's enough files in each zarr, then doing the uploads in serial won't be much of a penalty in terms of taking advantage of multi-threaded upload. Will need to make some changes to ensure all non-zarr items are uploaded with one gsutil call.

There is now a public append_zarr_along_time function that is directly tested.

Ok. It does seem like shift_store is a good entrypoint for the tests, since the upload_dir requires remote interaction.

Although you could mock upload_dir to work with local files (e.g. use copy_tree instead of upload_dir). That would be a quick route to 100% coverage, and may be a useful feature besides for post-processing on HPC.

Actually, gsutil can copy from local to local, so we can use it in tests without remote interaction!

Although maybe there is still remote interaction for auth? not sure, but either way, it works fine in the test I wrote.

Co-authored-by: Noah D Brenowitz <nbren12@gmail.com> Co-authored-by: Spencer Clark <spencerkclark@gmail.com>

oliverwm1 · 2020-08-26T02:16:14Z

This is ready for re-review. The most substantive change I made was introducing a new append_zarr_along_time function. This will be useful if we want use this Zarr-appending capability in contexts outside of run-fv3gfs. (Although probably should add some more safety checks if/when we use it elsewhere)

spencerkclark

Thanks @oliverwm1, this looks good to me now!

spencerkclark · 2020-08-26T12:52:10Z

workflows/argo/README.md

+|Parameter| Description|
+|-------- |-------------|
+| fv3config | String representation of an fv3config object |
+| runfile | String representation of an fv3gfs runfile |
+| output-url | GCS url for outputs |
+| fv3gfs-image | Docker image used to run model. Must have fv3gfs-wrapper and fv3config installed. |
+| post-process-image | Docker image used to post-process and upload outputs |
+| chunks | (optional) String describing desired chunking of diagnostics |
+| cpu | (optional) Requested cpu for run-model step |
+| memory | (optional) Requested memory for run-model step |
+| segment-count | (optional) Number of segments to run |
+| working-volume-name | (optional) Name of volume for temporary work. Volume claim must be made prior to start of run-fv3gfs workflow. |
+| external-volume-name | (optional) Name of volume with external data required for model run. E.g. for restart data in a nudged run. |


nit: could pass this through a Markdown table formatter so it looks nice in plain text too (e.g. here, or here).

Oh nice, thanks!

nbren12

Thanks for the changes. I like the new public API, and it's tests are good enough to cover the cases of _shift_array so at least we can feel free to delete that test if it becomes obsolete. TBH I still do think we can delete it now despite it's initial usefulness for development, but that's up to you.

The docs looks great.

I have a couple suggestions below about names. The only "required change" is to make the dimension name an argument of "append_zarr_along_time".

nbren12 · 2020-08-26T16:15:12Z

workflows/post_process_run/append.py

+        source_store = zarr.open(source_path, mode="r+")
+        target_store = zarr.open_consolidated(fsspec.get_mapper(target_path))
+        _set_time_units_like(source_store, target_store)
+        _shift_store(source_store, "time", _get_dim_size(target_store, "time"))


I realize it does have to be a time-like coordinate , but can you make the dimension name an argument in case it is named something other than "time". This would also allow you to remove the hard-coded name "time" in the tests.

nbren12 · 2020-08-26T16:18:34Z

workflows/post_process_run/append.py

+                destination_file = os.path.join(destination, file_)
+                logger.info(f"Appending {rundir_file} to {destination_file}")
+                append_zarr_along_time(rundir_file, destination_file, fs)
+                shutil.rmtree(rundir_file)  # remove local copy so not uploaded twice


At first glance, this line looks very dangerous, but I realize now that it is happening on a temporary copy. Can you rename rundir to tmp_rundir (or something) to clarify that this is clean-up of temporary files.

nbren12 · 2020-08-26T16:24:37Z

workflows/argo/README.md

+| fv3gfs-image | Docker image used to run model. Must have fv3gfs-wrapper and fv3config installed. |
+| post-process-image | Docker image used to post-process and upload outputs |
+| chunks | (optional) String describing desired chunking of diagnostics |
+| cpu | (optional) Requested cpu for run-model step |


For the optional values, it would be nice to know the default, but I admit these could quickly become out of date.

Yeah I went back and forth on this. We need an argo parser to auto-gen the docs!

Oliver Watt-Meyer added 17 commits August 14, 2020 22:29

Add initial workflow for segmented fv3gfs run

7f925de

Initial implememtation of shift_chunks and append_run

a2adeea

Add temporary config/scripts for testing workflow

1f0ab96

Ensure _shift_store works with dim with coord

3f62ade

Merge branch 'master' into feature/segmented-fv3gfs-runs

63c3af8

Fix append_run main. Uses output from post_process

bd499aa

Use initial timestamp as segment label

03b05d1

Add append_run.py to post_process Docker image

16fae7d

Remove unused shutil import

866ed8d

Change argo workflow to use append_run to do uploads

72f8037

Merge branch 'master' into feature/segmented-fv3gfs-runs

3ffe137

Make sure append_run.py runs as python

e230d1d

Install fsspec and gcsfs in post_process_run docker image

7b1f629

Skip symlinks for gsutil upload step

53bf2ae

Allow post process script to output to local dir

eaba512

Post-process and then append run for upload

a7b9633

Update test script to use latest post-process image

12afa9e

oliverwm1 commented Aug 19, 2020

View reviewed changes

Oliver Watt-Meyer added 12 commits August 20, 2020 20:11

Convert time units for any time-like variables

673fee3

Remove unused import

1754bc1

Clear /mnt/data/rundir and /tmp/processed_rundir after each step

6326c85

Add more logging statements to append_run main

e860c96

Save restarts to shared volume

00269c6

Do append run work within tempdir

7f7ec71

Add more unit tests for append_run, fix order of renaming

fb3e2ca

Use run-fv3gfs template within nudging template

a7e4be5

Delete unused imports

4f8286c

Change run-fv3gfs to a workflow template

196f5a4

Remove temporary test configs/scripts

6343b4d

Fix e2e integration test

23976b0

Oliver Watt-Meyer added 4 commits August 25, 2020 16:17

Generate working volume in run-fv3gfs workflow template

2c0f223

Revert "Generate working volume in run-fv3gfs workflow template"

41d6ea2

This reverts commit 2c0f223.

Make volume claims in entrypoint workflow

048e124

Update README

90f708c

spencerkclark reviewed Aug 25, 2020

View reviewed changes

nbren12 suggested changes Aug 25, 2020

View reviewed changes

Oliver Watt-Meyer and others added 15 commits August 25, 2020 14:05

Apply suggestions from code review

28a590d

Co-authored-by: Noah D Brenowitz <nbren12@gmail.com> Co-authored-by: Spencer Clark <spencerkclark@gmail.com>

Improve volume naming and docs

bae1d99

Pithier volume name

5d88fda

Rename submission-count to segment-count

714ae94

Add some explanatory comments to run-fv3gfs workflow

7065ebb

Mention gcp-secret-key volume

cf114cc

Define new public func append_zarr_along_time

fcfd66b

Remove trailing whitespace

664cff2

Merge branch 'master' into feature/segmented-fv3gfs-runs

323ab9a

Rename append_run to append_segment

a900fe9

Move test_append_run to test_append

38f37c9

Delete copy_tree implementation in test_append

e626060

Remove unused shutil import

1f242c3

Fix wrong arg name

c15ce0d

Actually fix

7e312de

spencerkclark approved these changes Aug 26, 2020

View reviewed changes

nbren12 approved these changes Aug 26, 2020

View reviewed changes

Oliver Watt-Meyer added 4 commits August 26, 2020 18:50

Use consolidated=True to open nudging data in e2e test

3d224d1

Format parameter table in README

8a7b4d0

NB rereview comments on append

f448e86

Echo output of argo terminate in e2e run_test.sh

6d7d335

oliverwm1 merged commit 3938bc9 into master Aug 26, 2020

oliverwm1 deleted the feature/segmented-fv3gfs-runs branch August 26, 2020 19:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add WorkflowTemplate run-fv3gfs for segmented long runs #576

Add WorkflowTemplate run-fv3gfs for segmented long runs #576

oliverwm1 commented Aug 19, 2020 •

edited

Loading

oliverwm1 Aug 19, 2020

spencerkclark left a comment

nbren12 left a comment •

edited

Loading

nbren12 Aug 25, 2020

oliverwm1 Aug 25, 2020

nbren12 Aug 25, 2020

oliverwm1 Aug 25, 2020

oliverwm1 Aug 26, 2020

nbren12 Aug 26, 2020

oliverwm1 Aug 26, 2020

oliverwm1 Aug 26, 2020

oliverwm1 commented Aug 26, 2020

spencerkclark left a comment

spencerkclark Aug 26, 2020

oliverwm1 Aug 26, 2020

nbren12 left a comment

nbren12 Aug 26, 2020

nbren12 Aug 26, 2020

oliverwm1 Aug 26, 2020

nbren12 Aug 26, 2020 •

edited

Loading

oliverwm1 Aug 26, 2020



		@pytest.mark.parametrize("with_coords", [True, False])
		def test_appending_shifted_zarr_gives_expected_ds(tmpdir, with_coords):

Add WorkflowTemplate run-fv3gfs for segmented long runs #576

Add WorkflowTemplate run-fv3gfs for segmented long runs #576

Conversation

oliverwm1 commented Aug 19, 2020 • edited Loading

Choose a reason for hiding this comment

spencerkclark left a comment

Choose a reason for hiding this comment

nbren12 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oliverwm1 commented Aug 26, 2020

spencerkclark left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nbren12 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nbren12 Aug 26, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oliverwm1 commented Aug 19, 2020 •

edited

Loading

nbren12 left a comment •

edited

Loading

nbren12 Aug 26, 2020 •

edited

Loading