update icepack and fix 2 failing tests #163

apcraig · 2018-08-02T16:57:32Z

Update icepack version to current master, fix 2 failing tests, add binary restart test, update gordon intel compiler version

Developer(s): tcraig
Please suggest code Pull Request reviewers in the column at right.
Are the code changes bit for bit, different at roundoff level, or more substantial? bit-for-bit
Is the documentation being updated with this PR? (Y/N) N
If not, does the documentation need to be updated separately at a later time? (Y/N) N
Other Relevant Details:

This change is bit-for-bit, see full test results here for conrad, hash 8b44ba1,

https://github.com/CICE-Consortium/Test-Results/wiki/cice_by_hash_forks

There are still failing tests, but the boxrestore and medium test now run and validate. The gx1 needs a mod in icepack which will be coming soon. Add binary restart test, need to update the validation process of multiple binary restart files. Update the gordon intel compiler version to be consistent with conrad, was causing some test failures.

eclare108213

Should the icepack submodule update be here?
Is there a way for the scripts to recognize when there aren't enough processors available for a given test, and gracefully skip it with a message to that effect?

apcraig · 2018-08-03T19:22:44Z

My intent with this PR was mainly to update the icepack submodule, so that is correct. I also fixed a few other issues in CICE that deal with some test problem.

Regarding skipping tests with not enough resources. It's probably possible, but we'd have to provide some info to the scripts about how much resources each machine has. It wouldn't be too hard to implement. But the machine should reject jobs that are too large automatically too. Maybe we could add an issue to discuss this feature and how we might want it to behave.

eclare108213 · 2018-08-03T19:41:19Z

There's at least one file missing...

cice.setup: ERROR, /turquoise/usr/projects/climate/eclare/CICE.v6.0/github/CICE.cicetestC/configuration/scripts/options/set_[nml,env].iobinary not found

…testC

apcraig · 2018-08-03T19:44:58Z

@eclare108213 thanks for catching that, just added the missing files.

eclare108213 · 2018-08-03T22:01:36Z

I ran the base_suite, which passed everything except for the 40-pe run, which I expected not to work because the queue I use won't allow that. It shows as pending in the test results, although the submission actually failed.

[eclare@pi-fe1 testsuite.b01]$ results.csh | grep PEND
PEND pinto_intel_restart_gx1_40x4_droundrobin_short run
1 of 108 tests PENDING
[eclare@pi-fe1 testsuite.b01]$
[eclare@pi-fe1 testsuite.b01]$ squeue -u eclare
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

I didn't find a record of the failure in the base_suite output, though. I had to resubmit the job for just that case in order to get this info:

[eclare@pi-fe1 pinto_intel_restart_gx1_40x4_droundrobin_short.b01]$ cice.submit
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

Maybe we could capture that information somehow? I don't know how important this is.

apcraig · 2018-08-04T04:38:53Z

Let me think about this issue. I think if a case cannot run, it probably shows up a "grey", ie "unknown" in the results. Does it prevent the report_results from running or create problems otherwise? We could try to have the submit script capture an error and report that to the test results. We could also do something like remove the larger jobs from the base_suite, so that suite would consist of only smaller and shorter jobs. We could then have another suite, bigjobs_suite, that would build and run some bigger (more pes, longer runs, larger grids) and that might only be run on hardware that can handle it. I think if the case returns and "unknown" result in the results and it doesn't get in the way of other features, then I think that's fine (maybe even correct). If that's not the case, then I'll look into fixing that.

apcraig · 2018-08-04T04:40:06Z

I think fixing the reporting/behavior of the unrunable job should be a separate issue maybe in a future PR.

eclare108213 · 2018-08-06T16:05:37Z

That's fine, we can address this issue separately. The pending test shows up as 'fail' on the test-results wiki:
https://github.com/CICE-Consortium/Test-Results/wiki/3153aeaebb.pinto.intel.180803.195744

update icepack and fix 2 failing tests

1b6341c

apcraig requested review from eclare108213, mattdturner and rallard77 August 2, 2018 16:57

update cheyenne and hobart batch mail settings

0f6d7ed

apcraig mentioned this pull request Aug 2, 2018

boxrestart test is failing exact restart #138

Closed

add binary restart test

972cfbd

apcraig assigned eclare108213 and apcraig Aug 3, 2018

mattdturner approved these changes Aug 3, 2018

View reviewed changes

update gordon intel compiler version

1b3f11b

eclare108213 reviewed Aug 3, 2018

View reviewed changes

apcraig added 2 commits August 3, 2018 19:42

add iobinary files

679acdf

Merge branch 'cicetestC' of https://github.com/apcraig/cice into cice…

3153aea

…testC

eclare108213 approved these changes Aug 6, 2018

View reviewed changes

eclare108213 merged commit 0b910c8 into CICE-Consortium:master Aug 6, 2018

apcraig deleted the cicetestC branch August 17, 2022 20:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update icepack and fix 2 failing tests #163

update icepack and fix 2 failing tests #163

apcraig commented Aug 2, 2018 •

edited

Loading

eclare108213 left a comment

apcraig commented Aug 3, 2018

eclare108213 commented Aug 3, 2018

apcraig commented Aug 3, 2018

eclare108213 commented Aug 3, 2018

apcraig commented Aug 4, 2018

apcraig commented Aug 4, 2018

eclare108213 commented Aug 6, 2018

update icepack and fix 2 failing tests #163

update icepack and fix 2 failing tests #163

Conversation

apcraig commented Aug 2, 2018 • edited Loading

eclare108213 left a comment

Choose a reason for hiding this comment

apcraig commented Aug 3, 2018

eclare108213 commented Aug 3, 2018

apcraig commented Aug 3, 2018

eclare108213 commented Aug 3, 2018

apcraig commented Aug 4, 2018

apcraig commented Aug 4, 2018

eclare108213 commented Aug 6, 2018

apcraig commented Aug 2, 2018 •

edited

Loading