Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update icepack and fix 2 failing tests #163

Merged
merged 6 commits into from
Aug 6, 2018

Conversation

apcraig
Copy link
Contributor

@apcraig apcraig commented Aug 2, 2018

Update icepack version to current master, fix 2 failing tests, add binary restart test, update gordon intel compiler version

  • Developer(s): tcraig

  • Please suggest code Pull Request reviewers in the column at right.

  • Are the code changes bit for bit, different at roundoff level, or more substantial? bit-for-bit

  • Is the documentation being updated with this PR? (Y/N) N
    If not, does the documentation need to be updated separately at a later time? (Y/N) N

  • Other Relevant Details:

This change is bit-for-bit, see full test results here for conrad, hash 8b44ba1,

https://github.com/CICE-Consortium/Test-Results/wiki/cice_by_hash_forks

There are still failing tests, but the boxrestore and medium test now run and validate. The gx1 needs a mod in icepack which will be coming soon. Add binary restart test, need to update the validation process of multiple binary restart files. Update the gordon intel compiler version to be consistent with conrad, was causing some test failures.

Copy link
Contributor

@eclare108213 eclare108213 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the icepack submodule update be here?
Is there a way for the scripts to recognize when there aren't enough processors available for a given test, and gracefully skip it with a message to that effect?

@apcraig
Copy link
Contributor Author

apcraig commented Aug 3, 2018

My intent with this PR was mainly to update the icepack submodule, so that is correct. I also fixed a few other issues in CICE that deal with some test problem.

Regarding skipping tests with not enough resources. It's probably possible, but we'd have to provide some info to the scripts about how much resources each machine has. It wouldn't be too hard to implement. But the machine should reject jobs that are too large automatically too. Maybe we could add an issue to discuss this feature and how we might want it to behave.

@eclare108213
Copy link
Contributor

There's at least one file missing...

cice.setup: ERROR, /turquoise/usr/projects/climate/eclare/CICE.v6.0/github/CICE.cicetestC/configuration/scripts/options/set_[nml,env].iobinary not found

@apcraig
Copy link
Contributor Author

apcraig commented Aug 3, 2018

@eclare108213 thanks for catching that, just added the missing files.

@eclare108213
Copy link
Contributor

I ran the base_suite, which passed everything except for the 40-pe run, which I expected not to work because the queue I use won't allow that. It shows as pending in the test results, although the submission actually failed.

[eclare@pi-fe1 testsuite.b01]$ results.csh | grep PEND
PEND pinto_intel_restart_gx1_40x4_droundrobin_short run
1 of 108 tests PENDING
[eclare@pi-fe1 testsuite.b01]$
[eclare@pi-fe1 testsuite.b01]$ squeue -u eclare
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

I didn't find a record of the failure in the base_suite output, though. I had to resubmit the job for just that case in order to get this info:

[eclare@pi-fe1 pinto_intel_restart_gx1_40x4_droundrobin_short.b01]$ cice.submit
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

Maybe we could capture that information somehow? I don't know how important this is.

@apcraig
Copy link
Contributor Author

apcraig commented Aug 4, 2018

Let me think about this issue. I think if a case cannot run, it probably shows up a "grey", ie "unknown" in the results. Does it prevent the report_results from running or create problems otherwise? We could try to have the submit script capture an error and report that to the test results. We could also do something like remove the larger jobs from the base_suite, so that suite would consist of only smaller and shorter jobs. We could then have another suite, bigjobs_suite, that would build and run some bigger (more pes, longer runs, larger grids) and that might only be run on hardware that can handle it. I think if the case returns and "unknown" result in the results and it doesn't get in the way of other features, then I think that's fine (maybe even correct). If that's not the case, then I'll look into fixing that.

@apcraig
Copy link
Contributor Author

apcraig commented Aug 4, 2018

I think fixing the reporting/behavior of the unrunable job should be a separate issue maybe in a future PR.

@eclare108213
Copy link
Contributor

That's fine, we can address this issue separately. The pending test shows up as 'fail' on the test-results wiki:
https://github.com/CICE-Consortium/Test-Results/wiki/3153aeaebb.pinto.intel.180803.195744

@eclare108213 eclare108213 merged commit 0b910c8 into CICE-Consortium:master Aug 6, 2018
@apcraig apcraig deleted the cicetestC branch August 17, 2022 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants