Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update mam4xx to include fix for wet scavenging #6894

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

mjschmidt271
Copy link
Contributor

This brings in changes to mam4xx's wet_dep.hpp that correct the size of a work array to fix out-of-bounds errors that have led to recent fails for the mam4_wetscav_standalone test.

Note: This PR points to a branch in mam4xx that has yet to be merged into main--I will force push when that associated PR is approved/merged.

@mjschmidt271
Copy link
Contributor Author

@bartgol @mahf708 @AaronDonahue @jgfouca

Can't add reviewers, so please take a look when you have a chance!

@singhbalwinder singhbalwinder added the MAM4xx MAM4xx related changes label Jan 13, 2025
@singhbalwinder
Copy link
Contributor

singhbalwinder commented Jan 13, 2025

Thanks, @mjschmidt271. Can you please also uncomment the wet scavenging test in the PR? it was commented out in #6882 PR

@mjschmidt271
Copy link
Contributor Author

Thanks, @mjschmidt271. Can you please also uncomment the wet scavenging test in the PR? it was commented out in #6882 PR

Ah, good call--will do!

@bartgol
Copy link
Contributor

bartgol commented Jan 14, 2025

Did you verify that with this mods we are now deterministic on GPU? If so, you could re-enable all mam4 tests that were previously disabled. The checks will fail since we don't have the up-to-date baselines, but at least we can then gen baselines right after merging this. Also, you should add some cmake code for the mam4 tests that run the np1_vs_npN tests, so we can already verify non-determinism there.

Edit: turns out we only run np1 on GPU, but I am planning to change that soon. Back when we tested on weaver, we had all builds in || on the same compute node, so we did not know if we had 1 or 2 gpus (3 builds for 4 GPUs), so we just hard-coded to 1 max rank. Now we run ONE build on a blake node, which has 2 GPUs, so we should run also np2 tests.


# Currently, this test has a runtime error. Disabling while MAM folks debug.
# add_subdirectory(mam/wet_scav)
add_subdirectory(mam/wet_scav)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are also currently disabled mam tests in the multiple-process subfolders.

@mjschmidt271
Copy link
Contributor Author

@bartgol - I don't believe this will affect the non-determinism in the mam4_aero_microphys_standalone test, but is restricted to the wet scavenging issue you'd previously mentioned with out-of-bounds access. As such, mam4_wetscav_standalone is re-enabled but not mam4_aero_microphys_standalone.

I know James is working on the microphysics non-det, though, so I'd guess he'll knock it out before I even have a chance to take a look again 🙂

And I will put those extra test cases next on my list--thanks for that info!

@bartgol
Copy link
Contributor

bartgol commented Jan 14, 2025

@bartgol - I don't believe this will affect the non-determinism in the mam4_aero_microphys_standalone test, but is restricted to the wet scavenging issue you'd previously mentioned with out-of-bounds access. As such, mam4_wetscav_standalone is re-enabled but not mam4_aero_microphys_standalone.

I know James is working on the microphysics non-det, though, so I'd guess he'll knock it out before I even have a chance to take a look again 🙂

And I will put those extra test cases next on my list--thanks for that info!

Ah I see. Thanks!

@mjs271 mjs271 force-pushed the mjs271/mam4xx/resolve_wetscav branch from f78d3d1 to dbf4f3b Compare January 14, 2025 01:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MAM4xx MAM4xx related changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants