-
Notifications
You must be signed in to change notification settings - Fork 15
Add additional files to the mpas-analysis cache for time series analysis #760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add additional files to the mpas-analysis cache for time series analysis #760
Conversation
96b3cfe to
2084313
Compare
| for file in "${files[@]}" | ||
| do | ||
| cp ${identifier}/timeseries/${file} cache/timeseries/${file} | ||
| done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonbob This is the set of lines that will need to be run once manually to cache the files that weren't cached with the released version of MPAS-Analysis
Note: defined above files=( "mpasIndexOcean.nc" "mpasTimeSeriesOcean.nc" "seaIceAreaVolNH.nc" "seaIceAreaVolSH.nc")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbegeman I think this line might be the trouble that caused cp: cannot stat 'cache/mpasIndexOcean.nc': No such file or directory issues in my second run test. The Cache output files are saved in cache/timeseries, however when the second run is reading files from cache, it tried to find files from cache
|
@cbegeman thank you for working on this update. I looks like the mpas-analysis update has been merged (MPAS-Dev/MPAS-Analysis#1126) and @xylar has an activation script to use this latest version. To be able to test this with the HR spin-up run through zppy, we will need to finalize this branch as well. Is this branch ready for review? |
Not quite, but I will tomorrow... |
|
@chengzhuzhang and @xylar Thanks for your help! Yes, it is ready for review. |
|
@xylar Let me know once you have the activation commands and then I can run a subset of the |
|
Okay, the environment and activation script are in place: |
Testing steps -- `mpas_analysis` subsetcd ~/ez/zppy
git status
# On branch add-legacy-cfgs
# nothing to commit, working tree clean
git remote add cbegeman git@github.com:cbegeman/zppy.git
git remote -v
# cbegeman git@github.com:cbegeman/zppy.git (fetch)
# cbegeman git@github.com:cbegeman/zppy.git (push)
git fetch cbegeman fix-mpas-analysis-cache
git checkout -b fix-mpas-analysis-cache cbegeman/fix-mpas-analysis-cache
git log
# Good, last commits:
# Add additional files to the mpas-analysis cache for time series analysis
# Bump to 3.1.0 (#759)
# Edit tests/integration/utils.py
# TEST_SPECIFICS:
# TEST_SPECIFICS: Dict[str, Any] = {
# # These are custom environment_commands for specific tasks.
# # Never set these to "", because they will print the line
# # `environment_commands = ""` for the corresponding task,
# # thus overriding the value set higher up in the cfg.
# # That is, there will be no environment set.
# # (`environment_commands = ""` only redirects to Unified
# # if specified under the [default] task)
# "diags_environment_commands": "source /lcrc/soft/climate/e3sm-unified/load_latest_e3sm_unified_chrysalis.sh",
# "global_time_series_environment_commands": "source /lcrc/soft/climate/e3sm-unified/load_latest_e3sm_unified_chrysalis.sh",
# "pcmdi_diags_environment_commands": "source /lcrc/soft/climate/e3sm-unified/load_latest_e3sm_unified_chrysalis.sh",
# # This is the environment setup for other tasks.
# # Leave as "" to use the latest Unified environment.
# "environment_commands": "source /lcrc/soft/climate/e3sm-unified/load_latest_e3sm_unified_chrysalis.sh",
# "cfgs_to_run": [
# "weekly_bundles",
# "weekly_comprehensive_v2",
# "weekly_comprehensive_v3",
# "weekly_legacy_3.0.0_bundles",
# "weekly_legacy_3.0.0_comprehensive_v2",
# "weekly_legacy_3.0.0_comprehensive_v3",
# ],
# "tasks_to_run": [
# # "e3sm_diags",
# "mpas_analysis",
# # "global_time_series",
# # "ilamb",
# # "pcmdi_diags",
# ],
# "unique_id": "test-fix-mpas-analysis-cache",
# }
git diff
# Diff looks good
lcrc_conda # Activate conda
rm -rf build
conda clean --all --y
conda env create -f conda/dev.yml -n test-fix-mpas-analysis-cache
conda activate test-fix-mpas-analysis-cache
pre-commit run --all-files
python -m pip install .
pytest tests/test_*.py
# 35 passed in 0.40s
python tests/integration/utils.py
# Everything is using Unified.
# We still need to add the mpas-analysis specific environment!
git add -A
git commit -m "Changes for testing"
# Add mpas_analysis_environment_commands
git diff
# Diff looks good
python tests/integration/utils.py
# Good, now we have:
# mpas_analysis_environment_commands=source /lcrc/soft/climate/e3sm-unified/load_hr_mpas_analysis.sh
pre-commit run --all-files # Iterate until passing
git add -A
python tests/integration/utils.py # Rerun
git status # Confirm there were no changes based on the reun
git commit -m "Add mpas_analysis_environment_commands"
# Now, we have one more testing change to make.
# The bundles test cfgs don't test for mpas_analysis, so we can remove them:
# "cfgs_to_run": [
# # "weekly_bundles",
# "weekly_comprehensive_v2",
# "weekly_comprehensive_v3",
# # "weekly_legacy_3.0.0_bundles",
# "weekly_legacy_3.0.0_comprehensive_v2",
# "weekly_legacy_3.0.0_comprehensive_v3",
# ],
python tests/integration/utils.py
git diff
# Diff looks good
zppy -c tests/integration/generated/test_weekly_comprehensive_v2_chrysalis.cfg
zppy -c tests/integration/generated/test_weekly_comprehensive_v3_chrysalis.cfg
zppy -c tests/integration/generated/test_weekly_legacy_3.0.0_comprehensive_v2_chrysalis.cfg
zppy -c tests/integration/generated/test_weekly_legacy_3.0.0_comprehensive_v3_chrysalis.cfg
sq | wc -l # Tue 12/02 10:47 => 82 - header row = 81 jobs
# For reference:
alias sq
# alias sq='sqa -u ac.forsyth2'
alias sqa
# alias sqa='squeue -o "%8u %.7a %.4D %.9P %7i %.2t %.10r %.10M %.10l %j" --sort=P,-t,-p'
sq | wc -l # Tue 12/02 11:17 => 12 - header row = 11 jobs
sq
# All jobs have state: pending for reason: dependency.
# So, nothing more is going to run.
# Review finished runs
### v2 ###
cd /lcrc/group/e3sm/ac.forsyth2/zppy_weekly_comprehensive_v2_output/test-fix-mpas-analysis-cache/v2.LR.historical_0201/post/scripts
grep -v "OK" *status
# global_time_series_1980-1990.status:WAITING 1028901
# mpas_analysis_ts_1980-1984_climo_1980-1984.status:ERROR (1)
# mpas_analysis_ts_1980-1990_climo_1985-1990.status:WAITING 1028900
cd /lcrc/group/e3sm/ac.forsyth2/zppy_weekly_legacy_3.0.0_comprehensive_v2_output/test-fix-mpas-analysis-cache/v2.LR.historical_0201/post/scripts
grep -v "OK" *status
# global_time_series_1980-1990.status:WAITING 1028948
# mpas_analysis_ts_1980-1984_climo_1980-1984.status:ERROR (1)
# mpas_analysis_ts_1980-1990_climo_1985-1990.status:WAITING 1028947
### v3 ###
cd /lcrc/group/e3sm/ac.forsyth2/zppy_weekly_comprehensive_v3_output/test-fix-mpas-analysis-cache/v3.LR.historical_0051/post/scripts
grep -v "OK" *status
# global_time_series_classic_pdf_both_1985-1995.status:WAITING 1028927
# global_time_series_classic_pdf_original_1985-1995.status:WAITING 1028928
# global_time_series_viewer_both_1985-1995.status:WAITING 1028924
# global_time_series_viewer_original_1985-1995.status:WAITING 1028925
# mpas_analysis_ts_1985-1989_climo_1985-1989.status:ERROR (1)
# mpas_analysis_ts_1985-1995_climo_1990-1995.status:WAITING 1028923
cd /lcrc/group/e3sm/ac.forsyth2/zppy_weekly_legacy_3.0.0_comprehensive_v3_output/test-fix-mpas-analysis-cache/v3.LR.historical_0051/post/scripts
grep -v "OK" *status
# global_time_series_1985-1995.status:WAITING 1028965
# mpas_analysis_ts_1985-1989_climo_1985-1989.status:ERROR (1)
# mpas_analysis_ts_1985-1995_climo_1990-1995.status:WAITING 1028964@cbegeman, @xylar -- I encountered some errors while testing. Output paths below: # Let's review the errors
cd /lcrc/group/e3sm/ac.forsyth2/zppy_weekly_comprehensive_v2_output/test-fix-mpas-analysis-cache/v2.LR.historical_0201/post/scripts
emacs mpas_analysis_ts_1980-1984_climo_1980-1984.o1028899
# TypeError: cannot pickle 'lxml.etree._ElementTree' object
cd /lcrc/group/e3sm/ac.forsyth2/zppy_weekly_legacy_3.0.0_comprehensive_v2_output/test-fix-mpas-analysis-cache/v2.LR.historical_0201/post/scripts
emacs mpas_analysis_ts_1980-1984_climo_1980-1984.o1028946
# TypeError: cannot pickle 'lxml.etree._ElementTree' object
cd /lcrc/group/e3sm/ac.forsyth2/zppy_weekly_comprehensive_v3_output/test-fix-mpas-analysis-cache/v3.LR.historical_0051/post/scripts
emacs mpas_analysis_ts_1985-1989_climo_1985-1989.o1028922
# TypeError: cannot pickle 'lxml.etree._ElementTree' object
cd /lcrc/group/e3sm/ac.forsyth2/zppy_weekly_legacy_3.0.0_comprehensive_v3_output/test-fix-mpas-analysis-cache/v3.LR.historical_0051/post/scripts
emacs mpas_analysis_ts_1985-1989_climo_1985-1989.o1028963
# TypeError: cannot pickle 'lxml.etree._ElementTree' objectCommon theme seems to be: |
|
@forsyth2 Thanks for testing! Is it clear to any of you whether the cache files being absent at the first call to mpas_analysis is responsible for the |
|
Note: I'll be merging #763 to update the test suite accordingly (tests needed to be updated to use a custom environment for |
|
Here's what Claude says when I give it the bash file The error occurs because the MPAS analysis code is trying to pickle (serialize) an Looking at the stack trace, the issue is: The problem happens when the code tries to start parallel tasks (line shows Potential solutions:
Quick fix: Modify the script to set [execute]
parallelTaskCount = 1This will make the analysis run serially, avoiding the pickling issue, though it will be slower. |
Further commentary from Claude: No, the missing cache files are not responsible for the 1. Missing Cache Files (Warning, not an error)cp: cannot stat 'cache/mpasIndexOcean.nc': No such file or directory
cp: cannot stat 'cache/mpasTimeSeriesOcean.nc': No such file or directory
cp: cannot stat 'cache/seaIceAreaVolNH.nc': No such file or directory
cp: cannot stat 'cache/seaIceAreaVolSH.nc': No such file or directoryThese warnings occur because this appears to be the first run of the analysis, so the cache directory doesn't have these files yet. The script handles this gracefully - it tries to copy cached files if they exist (to speed up processing), but continues normally if they don't. The files will be created during this run and cached for future runs (as shown at the end of the script with the "CACHE OUTPUT FILES" section). 2. TypeError (The actual crash)The
The error occurs when Python's multiprocessing tries to serialize the The missing cache files have no connection to the pickling error. The pickling error is a code/compatibility issue with how MPAS-Analysis handles XML parsing in parallel contexts. |
|
@forsyth2 Regardless, I can push a commit to only copy those files if they exist. |
Yes, that what it looks like to me at least.
Thanks, I guess not strictly necessary but it would help clean up the logs. |
I don't think so. I don't think MPAS-Analysis should be doing any pickling. |
Then, what would be doing the pickling? Based on the following, it looks like the error must be occurring inside the |
|
Looking at the error messages that @forsyth2 pointed to, it does look like it's happening in However, I suspect that it is being triggered by an unsafe use of ESMF in @forsyth2's test run. When we use ESMF from conda (i.e. with any environment other than E3SM-Unified), we shouldn't be calling it with |
It turns out that |
@xylar I can share more of the testing setup. All the steps I ran are in the expandable section in #760 (comment). Each directory also has a provenance cfg of the form Maybe something isn't set up correctly in the |
Historically, we have tested That would mean this issue is independent of this PR and is in fact an issue with how we test |
|
It looks like you all need to be setting |
Should that be true in general for standard |
Yes, it's possible that the pickle error is a side effect of an ESMF failure that you wouldn't have seen before. My hope is that running ESMF in serial might cause it not to return an error, which I think may be triggering the pickling issue. But it's also possible (maybe even likely) that the pickling problem is happening after the ESMF call finishes successfully. I will have to investigate further tomorrow. |
|
That's all fine. The |
|
@forsyth2 I'm on board with everything you're saying. Sorry for the confusion |
|
Here I'm documenting my steps to test this PR with updated mpas-analysis task:
|
chengzhuzhang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code change look reasonable to me, I will update after re-testing.
@xylar it would be great if you could do a code review. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wlin7 and @xylar, with the new commit, running mpas-analysis through zppy created reasonable time-series. Please review:
1.The testing setup: #760 (comment)
2.The results here
3.The output directory: /lcrc/group/e3sm2/ac.zhang40/E3SMv3_testings/v3.LR.historical_0051_try7
Please review if this PR provided expected behavior, also note implication on other dependency. One thing I checked is that the global-time-series plot would use files from cache/timeseries, as long as this directory structure remains, the downstream plotting for global-time-series won't be impacted.
However, there is one important item to note, to go with deletion for mpas history file.
Among the three global mean time-series plots: tracks changes of ocean heat content, sea-level, moc: the moc plot uses file from mpas-analysis cache, however the other two are computed from mpas history files (code see here)
xylar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One line I think probably should be removed.
@chengzhuzhang, can your comments be resolved at this point?
| # Remove one particularly large file which does not need to be cached | ||
| rm ${identifier}/timeseries/mpasTimeSeriesSeaIce.nc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to just be deleting a time-series file form the analysis directory, and not to be related to caching at all. Is this perhaps a relic of an earlier approach to copying cache files and something no longer needed? I would think it wouldn't be a great idea to delete output from the mpas-analysis output directory even if it is large unless we decide on a systematic way of cleaning things up that doesn't lose important data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing large files from MPAS-Analysis that are not needed was requested by HR (specifically @wlin7). Do you think that MPAS-Analysis would be a better place to do this clean-up? I'm not particularly tied to including this in this PR but maybe some more context would be helpful:
This file seems to be somewhat of an outlier in MPAS-Analysis output files in that it has dimensions of Time (equal the length of the requested time series) and nCells (equal to the size of the whole mesh, which seems particularly egregious given that only a portion of those cells would contain sea ice). Correct me if there are other files like this. It's not clear to me that this would be "losing important data" because it's just copying information from history files that would (ideally) be archived as opposed to derived analysis variables.
ncinfo timeseries/mpasTimeSeriesSeaIce.nc
...
dimensions(sizes): Time(72), StrLen(64), nCells(465044)
variables(dimensions): |S1 xtime_startMonthly(Time, StrLen), |S1 xtime_endMonthly(Time, StrLen), float32 timeMonthly_avg_iceAreaCell(Time, nCells), float32 timeMonthly_avg_iceVolumeCell(Time, nCells), float32 timeMonthly_avg_snowVolumeCell(Time, nCells)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I don't think zppy should be deleting MPAS-Analysis files. This needs to be fixed in MPAS-Analysis instead. Sorry for extra work for, though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there is a corresponding file for the the ocean as far as I know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's chat about this on Slack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After having a conversation on Slack, I've come around to this approach.
I think MPAS-Analysis would ideally delete this file itself. I think MPAS-Analysis would also ideally extract only the variables that it plots directly from mpasTimeSeriesOcean.nc into other, smaller time series files that could be cached instead. But this was not how it was designed and making these changes in a robust way will likely take more time than we have before HR needs a working solution.
So I am fine with this approach as a less ideal but working alternative.
@chengzhuzhang, I didn't quite follow what the solution is to this problem. Do you think the MPAS-Ocean history files for ocean heat content and sea level need to be retained in order for global time series analysis to succeed? Or is the suggestion to skip those variables? Or something else? It seems like it will be a problem if history files have to be retained for those plots, and also if those plots have to be skipped. |
xylar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm approving after walking back my requested change from before.
Thank you for a review, @xylar. |
@xylar Thanks for explaining the reasoning for the design decisions. @cbegeman Thank you for your work on this PR! @chengzhuzhang Do we just want a final review from @wlin7 before merging this? (I'll also want to merge the testing update #763). |
Yes. |
|
Thank you @cbegeman @xylar @chengzhuzhang for getting this enhanced caching mechanism to work. I did a verification with a regular run of mpas-analysis for ts_1850-1859_climo_1855-1859 vs. this one with caching IICE comparison confirms the diagnostics are identical. |
|
@wlin7 and @chengzhuzhang Thank you both for your review and testing! I think this is good to merge when @wlin7 approves. |
|
Thank you @wlin7 for reviewing and confirming! |
|
Thanks @cbegeman for working on this. Thanks @chengzhuzhang, @xylar, @wlin7 for reviewing/testing. I also just merged the associated test update (#763) |
|
Great. Thanks, @forsyth2! |
|
@xylar For for the |
|
@forsyth2 , I don't think so. I think the HR branch is unofficial and shouldn't become the new default for testing. |
1 similar comment
|
@forsyth2 , I don't think so. I think the HR branch is unofficial and shouldn't become the new default for testing. |
Summary
Objectives:
nCellsdimensions) when MPAS-Analysis has been completed which is not needed to extend time series plotsSelect one: This pull request is...
Small Change