-
Notifications
You must be signed in to change notification settings - Fork 15
add LAMMPS test for ALL and ALL+OBMD #303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This one proved hard to test, because it requires custom modules... I'll dump everything that's needed here, in case anyone ever needs to reproduce: For LAMMPS+OBMD For LAMMPS+ALL
For LAMMPS+ALL+OBMD:
|
|
I had to add the following to my |
|
For LAMMPS+OBMD I'm getting ok-ish runs, but occasionally I get a failure with e.g.: I also got: Both were on Genoa, 2-nodes scale. Edit: I was rerunning repeatedly to see if failures where consistent, but now I seem to get hangs (twice already) :( |
|
LAMMPS+ALL+OBMD about 50% of the time. It's not consistent whether this is on 1 or 2 node scales - both scales succeed every now and again. Edit: out of three runs, it seems that 2-node cases fail more often than 1-node, and genoa (192 cores) fails more often than rome (128 cores). Maybe high core counts make this unstable? Edit2: ran using |
|
LAMMPS+ALL Failing every time with sanity errors like: Edit: running with |
Which of the tests? |
Co-authored-by: Caspar van Leeuwen <33718780+casparvl@users.noreply.github.com>
Co-authored-by: Caspar van Leeuwen <33718780+casparvl@users.noreply.github.com>
These two: (on rome and genoa, so 4 test instances actually) |
|
Yeah I'm not sure about that check. Rodrigo said that all those values should be under 1.1 but even the example they are not so I picked the one that seemed to be under 1.1 most of the time. |
|
should EESSI_LAMMPS_ALL_balance_staggered_global on run on less than 16 cores or less |
|
I got for the 1-core test case for OBMD: The reason is that the unit changes. In the output file, I see: But the performance pattern matches for |
|
To test this, we should test three different environments:
Before you start, make sure to have this in your config: to make sure the 1. OBMD, ALL, ALL+OBMD with EESSI/2025.06 and This should run 56 tests for each CPU partition you have. 2. OBMD with EESSI/2023.06 and 3. All traditional LAMMPS test with In a (clean) interactive environment, do: |
|
I tested what was described in #303 (comment) Case 1:Case 2:Note that the failure at 16 nodes was known, and is resolved simply by running with a newer LAMMPS version based on the 2025.06 toolchain (i.e. the ALL+OBMD installation, see case 1). Case 3The 12 aborted tests because I couldn't get H100 allocations within reasonable time, nor could I run on 16 A100 GPUs in reasonable time (queue too busy). On 4 and 8 A100s, I got failures for the I don't think those are specific to the PR I've done here, the tests probably never ran successfully at these scales, so I don't think it should block the current PR. |
|
OBMD: |
Other |
|
#303 (comment) => Yes, that's expected , see my comments under the results of Case 2 on #303 (comment) |
Not sure why but case 1 also spawned cases detecting the old LAMMPS module as well which of course failed. But may be this was because my env was not clean. Other than that all tests passed. |
|
For Case 3: All tests passed, so no issues. |
satishskamath
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
Tests:
Sanity check/Validation of output:
This is checkin the value of atoms in the output. For lj and rhodo this value was stable but for simmulation.staggered.global this is not. Is this a good value to check? Since this does not seem to vary a lot I added a check that difference has to be under 50. Or maybe we should just drop this for the simmulation?
This is checking the value of neighbors in the output. For lj and rhodo this value was stable but for staggered.global this is not. Is this a good value to check?
Need to check
columns 13 - 18 (f_5[*])not sure how yet.Need to use
ndenprof.py with nden_profile.outbut have not figured out how to best do that.