Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated CICE_InitMod.F90 for cice driver to set all tracer values to … #16

Merged
merged 3 commits into from
Aug 7, 2017

Conversation

apcraig
Copy link
Contributor

@apcraig apcraig commented Aug 4, 2017

…c0 over land after

initialization. This supports bit-for-bit testing on different pe counts and different
decompositions where land block elmination might be different and results in all land tracer
values to be zero on history and restart files.

Updated the test scripts and added a new -td feature that allows test results to be compared
with each other. Refactored some of the scripts to make them more flexible for the -td option.
Changed the test name convention so machine is the first item in the name. This is to make
it easier to use the -td option.

Modified tests so now there are only smoke and restart tests. Added new settings (run5day,
run10day, run1year, etc) to support different length smoke tests. Refactored the test
scripts so there is a test_smoke.script and a test_restart.script now. Moved the main
test script from under tests to the scripts directory and renamed it to cice.test.setup.csh.
Updated the README.test.

Updated the base_suite file to add new tests and make testnames conform to new convention.

Changed the restart test to be a 10 day run with restart at day 5.

Added other new options to support different batch wall times (short, medium long), support
modifications for grids (gx1, gx3), to turn on the bfbflag (bfbflagT).

Added some new error checks to make sure create.case aborts if there is a problem detected.

Modified create.case -h so it now automatically generates the list of -s options and the
list of -t tests as part of the documentation.

These results are bit-for-bit over the ocean and not bit-for-bit over land for some tracers.

…c0 over land after

initialization.  This supports bit-for-bit testing on different pe counts and different
decompositions where land block elmination might be different and results in all land tracer
values to be zero on history and restart files.

Updated the test scripts and added a new -td feature that allows test results to be compared
with each other.  Refactored some of the scripts to make them more flexible for the -td option.
Changed the test name convention so machine is the first item in the name.  This is to make
it easier to use the -td option.

Modified tests so now there are only smoke and restart tests.  Added new settings (run5day,
run10day, run1year, etc) to support different length smoke tests.  Refactored the test
scripts so there is a test_smoke.script and a test_restart.script now.  Moved the main
test script from under tests to the scripts directory and renamed it to cice.test.setup.csh.
Updated the README.test.

Updated the base_suite file to add new tests and make testnames conform to new convention.

Changed the restart test to be a 10 day run with restart at day 5.

Added other new options to support different batch wall times (short, medium long), support
modifications for grids (gx1, gx3), to turn on the bfbflag (bfbflagT).

Added some new error checks to make sure create.case aborts if there is a problem detected.

Modified create.case -h so it now automatically generates the list of -s options and the
list of -t tests as part of the documentation.

These results are bit-for-bit over the ocean and not bit-for-bit over land for some tracers.
Copy link
Contributor

@mattdturner mattdturner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good to me. I haven't pulled and tested the changes, but will early next week.

The only thing I noticed that needs changing is in README.test, lines 161-162 and 168-169. Both of these pass "-m conrad" to create.case but then cd into "cheyenee_".

@apcraig
Copy link
Contributor Author

apcraig commented Aug 4, 2017

Thanks Matt. When you give it a try, let me know if there is anything you don't like or understand. We should start to put together a developers guide for the scripts once we feel the implementation is mature enough.

@eclare108213
Copy link
Contributor

I started running through the tests, following the new README.test instructions. The first one, setting up a baseline to compare to, worked fine. The second one fails, with no change to the code:
[eclare@wf-fe1 wolf_smoke_gx3_4x1.t01]$ cat test_output
PASS wolf_smoke_gx3_4x1 build
PASS wolf_smoke_gx3_4x1 run
FAIL wolf_smoke_gx3_4x1 compare different-data

Here is the standard out:
./cice.run:

CICE rundir is /net/scratch3/eclare/CICE_RUNS/wolf_smoke_gx3_4x1.t01
CICE log file is cice.runlog.170804-130215
CICE run started : Fri Aug 4 13:02:15 MDT 2017
CICE run finished: Fri Aug 4 13:02:28 MDT 2017

CICE COMPLETED SUCCESSFULLY
done ./cice.run

Regression Compare Mode:
Performing binary comparison between files
baseline: /net/scratch3/eclare/CICE_BASELINE/cicev6.0.0/wolf_smoke_gx3_4x1/restart/
test: /net/scratch3/eclare/CICE_RUNS/wolf_smoke_gx3_4x1.t01/restart/
Regression baseline and test dataset are different

and the standard error output:

cmp: /net/scratch3/eclare/CICE_RUNS/wolf_smoke_gx3_4x1.t01/restart/: Is a directory

So it looks like the filenames aren't included in the compare command?

@apcraig
Copy link
Contributor Author

apcraig commented Aug 4, 2017

I will test the cases in the README.test and update that document again. I was running the base suite. Maybe you can try that and see if it works. I agree that the README.test examples should ultimately be documented properly, but it's a work in progress :).

@eclare108213
Copy link
Contributor

eclare108213 commented Aug 4, 2017

Will do, and as long as you are updating README.test, there's still a mixture of conrad and cheyenne examples. Please pick one, and state somewhere near the first instance that it's a machine name, to be perfectly clear.

@eclare108213
Copy link
Contributor

Do the regression tests check or otherwise ensure that the model configuration used for the new version is the same as the old version, or provide an option for them to be different? For the Icepack restart implementation, we have been assuming that they would be (forced to be) the same and all tracers associated with a given configuration would be written for both the baseline and the comparison run.

@apcraig
Copy link
Contributor Author

apcraig commented Aug 4, 2017

So for regressions testing, the idea is that the default namelist and setup should not change. What we will do is have different settings, that can be specified via the -s command line option, to turn on different features and namelist. I imagine, we'll test 10-20 different configurations once we have things setup as part of a test suite. Each of those cases will have a different test name and each test is saved and checked separately. If the defaults change from version to version, or if a non-bit-for-bit change is introduced, the regression testing should pick that up. And those failures might be expected or not expected and the team can respond as needed.

@eclare108213
Copy link
Contributor

ok.
The suite isn't working either -- some ok, some not (see below). I checked one of the fails:
forrtl: severe (174): SIGSEGV, segmentation fault occurred

Find indices of diagnostic points

Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[30722,1],4]
Exit code: 174

At least two of the runs are failing on this line in ice_dyn_evp.F90:
276 call icepack_ice_strength (ncat, &

I have a little time this afternoon to dig deeper, just say where. Here's the full list:
[eclare@wf-fe1 base_suite.t10]$ results.csh
PASS wolf_smoke_gx3_8x2_diag1_run5day build
FAIL wolf_smoke_gx3_8x2_diag1_run5day run
PASS wolf_smoke_gx3_8x2_diag24_run1year_medium build
FAIL wolf_smoke_gx3_8x2_diag24_run1year_medium run
PASS wolf_smoke_gx3_4x1_debug_diag1_run5day build
PASS wolf_smoke_gx3_4x1_debug_diag1_run5day run
FAIL wolf_smoke_gx3_4x1_debug_diag1_run5day generate baseline-already-exists
PASS wolf_smoke_gx3_8x2_debug_diag1_run5day build
FAIL wolf_smoke_gx3_8x2_debug_diag1_run5day run
PASS wolf_smoke_gx3_4x2_diag1_run5day build
FAIL wolf_smoke_gx3_4x2_diag1_run5day run
PASS wolf_smoke_gx3_4x1_diag1_run5day_thread build
FAIL wolf_smoke_gx3_4x1_diag1_run5day_thread run
PASS wolf_smoke_gx3_4x1_diag1_run5day build
PASS wolf_smoke_gx3_4x1_diag1_run5day run
FAIL wolf_smoke_gx3_4x1_diag1_run5day generate baseline-already-exists
PASS wolf_restart_gx3_8x1 build
PASS wolf_restart_gx3_8x1 run-initial
PASS wolf_restart_gx3_8x1 run-restart
PASS wolf_restart_gx3_8x1 exact-restart
FAIL wolf_restart_gx3_8x1 generate baseline-already-exists
PASS wolf_restart_gx3_4x2_debug build
FAIL wolf_restart_gx3_4x2_debug run-initial

14 of 23 tests PASSED
9 of 23 tests FAILED

@apcraig
Copy link
Contributor Author

apcraig commented Aug 4, 2017

I just updated the scripts and README.test. That should be consistent now. All of the tests pass on cheyenne. The tests that are failing on wolf are using threads. The all MPI tests on wolf are passing. My guess is that the modules, batch settings, or compile settings on wolf are not set properly for threading. Maybe you haven't tried that yet?

@eclare108213
Copy link
Contributor

That's probably true -- I usually don't mess with threads. I'll look into it.

@mattdturner
Copy link
Contributor

Running the base_suite on Conrad, all tests pass except conrad_smoke_gx3_4x1_diag1_run5day_thread, which fails to run. The runlog shows:

Rank 0 [Mon Aug  7 15:24:19 2017] [c2-0c1s2n2] Fatal error in PMPI_Bcast: Invalid root, error stack:
PMPI_Bcast(1614): MPI_Bcast(buf=0x7fffffff4640, count=1, dtype=0x4c000829, root=-999, comm=0x84000004) failed
PMPI_Bcast(1576): Invalid root (value given was -999)

forrtl: error (76): Abort trap signal
Image              PC                Routine            Line        Source
cice               0000000000787396  Unknown               Unknown  Unknown
cice               0000000000769629  Unknown               Unknown  Unknown
cice               000000000075F03A  Unknown               Unknown  Unknown
cice               0000000000456F96  ice_broadcast_mp_          81  ice_broadcast.F90
cice               00000000004670DF  ice_diagnostics_m         759  ice_diagnostics.F90
cice               0000000000402DBD  cice_runmod_mp_ci          63  CICE_RunMod.F90
cice               000000000040064B  MAIN__                     50  CICE.F90
cice               00000000004005DE  Unknown               Unknown  Unknown
cice               0000000000DA7991  Unknown               Unknown  Unknown
cice               00000000004004B9  Unknown               Unknown  Unknown

The log also shows:

  Find indices of diagnostic points

 found point   1
   lat    lon   TLAT   TLON     i     j   block  task
  90.0    0.0 -999.0 -999.0     0     0     0  -999

 found point   2
   lat    lon   TLAT   TLON     i     j   block  task
 -65.0  -45.0  -64.2  -45.0    25    10     2     1

So, for some reason when CICE_THREADED is set to true, it fails to find one of the diagnostic points

@apcraig
Copy link
Contributor Author

apcraig commented Aug 7, 2017

Matt, Are the tests with mpi/openmp working on conrad? If so, the all MPI compile with treading explicitly on in the compile should also work in my experience. I can login to conrad if you want me to have a look as well. I don't think this means the current scripts implementation is bad, I think this means our test suite is being extended and we're catching new problems in the code/compile/etc that we need to fix. That's a good thing. I think we'll want to rapidly extend the test suite in the next few weeks more which I'm sure will lead to more work. It took me a while to get bit-for-bit restarts with different pe counts and decomps due to values over "land", but now that seems to be fixed. This is all good progress.

@mattdturner
Copy link
Contributor

Tony,

Are the tests with mpi/openmp working on conrad? If so, the all MPI compile with treading explicitly on in the compile should also work in my experience.

Yes, the hybrid MPI+OpenMP tests are working on Conrad:

PASS conrad_smoke_gx3_8x2_diag1_run5day build
PASS conrad_smoke_gx3_8x2_diag1_run5day run
PASS conrad_smoke_gx3_8x2_diag24_run1year_medium build
PASS conrad_smoke_gx3_4x1_debug_diag1_run5day build
PASS conrad_smoke_gx3_4x1_debug_diag1_run5day run
PASS conrad_smoke_gx3_8x2_debug_diag1_run5day build
PASS conrad_smoke_gx3_8x2_debug_diag1_run5day run
PASS conrad_smoke_gx3_4x2_diag1_run5day build
PASS conrad_smoke_gx3_4x2_diag1_run5day run
PASS conrad_smoke_gx3_4x2_diag1_run5day bfbcomp conrad_smoke_gx3_8x2_diag1_run5day.t00

I agree that the error I am getting points to an issue in the code/compile and not in the testing scripts. Perhaps I should submit this as an issue since it shouldn't delay/prevent the pull request from being merged.

@apcraig
Copy link
Contributor Author

apcraig commented Aug 7, 2017

@mattdturner. That makes sense to me. The issue should be specific, "single threaded runs on conrad failing" or something like that. As we extend the testing, these kinds of issues are probably going to become a good portion of our efforts at getting the model more robust. thanks.

Copy link
Contributor

@eclare108213 eclare108213 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worked through the test examples in README.test, and they were all successful on wolf using the updated Macro file (https://github.com/CICE-Consortium/CICE/pull/18/commits). I like the test suite, including plots of the output! As far as I'm concerned, this is ready to merge.

@apcraig apcraig merged commit 928a98c into CICE-Consortium:master Aug 7, 2017
JFLemieux73 pushed a commit to JFLemieux73/CICE that referenced this pull request Nov 17, 2021
@apcraig apcraig deleted the cicetestA branch August 17, 2022 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants