Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VP solver robustness issues ("bad departure points") (was: Verify b4b-ness of different MPI decompositions for the VP solver / performance evaluation of repro-vp branch) #40

Open
phil-blain opened this issue May 16, 2022 · 41 comments
Milestone

Comments

@phil-blain
Copy link
Owner

Tried a new suite with dynpicard, no OpenMP

# Test         Grid    PEs        Sets    BFB-compare
smoke          gx3     1x1        diag1,dynpicard
sleep 180
smoke          gx3     8x1        diag1,dynpicard   smoke_gx3_1x1_diag1_dynpicard
smoke          gx3     2x1        diag1,dynpicard   smoke_gx3_1x1_diag1_dynpicard
smoke          gx3     4x1        diag1,dynpicard   smoke_gx3_1x1_diag1_dynpicard

none of the three MPI cases are bfb with the 1x1 case.

@phil-blain
Copy link
Owner Author

phil-blain commented May 16, 2022

Tried setting b4bflag=reprosum in the 2x1 case. Still not b4b with the 1x1 case.

@phil-blain
Copy link
Owner Author

Tried with maxits_nonlin=500 Still not b4b with the 1x1 case.

@phil-blain
Copy link
Owner Author

OK, looking at the differences between 1x1 and {2x1, 4x1, 8x1} there are some differences, bigger in 4x1 and 8x1.

  • try with more nonlinear iterations
  • try disabling advection and thermo

@phil-blain
Copy link
Owner Author

With 5000 nonlinear iterations, differences are in the same range for 2, 4, 8 procs (vs 1). i.e. [-1E-6, 1E6].

@phil-blain
Copy link
Owner Author

note: this tests runs 1 day, and over these 24 time steps, only 4 need over 500 iterations to reach reltol_nonlin=1E-8 (default value in the namelist). With 5000 as maxits_nonlin, we still do not get more than max 1398 iterations.

@phil-blain
Copy link
Owner Author

With reltol_nonlin=1E-12 and maxits_nonlin=5000, the difference vs 1 proc is more or less the same for 2, 4, 8 procs, and looks something like this:

image

interestingly :

  • there are more differences in the South hemisphere (likely due to the fact that the concentration is very close to 1 in all the North hemisphere
  • the differences are not only near the coast, and do not appear to be near only the ice edge either

Note that we do not nearly reach 5000 iterations with this value of reltol_nonlin.

@phil-blain
Copy link
Owner Author

To get the number of iterations to reach the required tolerance:

\grep Arctic -B1 path_to_case_directory/logs/cice.runlog.220606-19* |\grep monitor

this is with diagfreq = 1 and monitor_nonlin=.true.

@phil-blain
Copy link
Owner Author

phil-blain commented Jun 10, 2022

  • If we disable thermo, ridging and transport, then with the same settings as above (reltol_nonlin=1E-12), we get differences in the range [-1E-16,1E-17] for uvel,vvel which is much more in line with what is expected. aice and hi are b4b.
  • If we only disable thermo (and same settings again), then we get slightly higher differences in the veloc components:
    -1 :       Date     Time   Level Gridsize    Miss :     Minimum        Mean     Maximum : Parameter name
     9 : 2005-01-02 00:00:00       0    11600    3594 : -3.7470e-16  8.6176e-20  3.4478e-16 : aice     
    11 : 2005-01-02 00:00:00       0    11600    3594 : -4.4409e-16  7.4066e-20  4.4409e-16 : hi   
    13 : 2005-01-02 00:00:00       0    11600    3594 : -5.2082e-17  8.6478e-17  6.9237e-13 : uvel          
    14 : 2005-01-02 00:00:00       0    11600    3594 : -5.5376e-13 -6.9204e-17  1.0734e-17 : vvel    

and now aice and hi start to not be b4b, but are still small

@phil-blain
Copy link
Owner Author

  • try the diag (and maybe ident) precond
  • do a long (5 years) gx1 run with different MPI decomp and compare the thickness field at the end (mid-Jan).

@phil-blain
Copy link
Owner Author

With precond=diag, no thermo, no transport:

    -1 :       Date     Time   Level Gridsize    Miss :     Minimum        Mean     Maximum : Parameter name    
    13 : 2005-01-02 00:00:00       0    11600    3594 : -5.6379e-18  6.6466e-20  3.9514e-16 : uvel          
    14 : 2005-01-02 00:00:00       0    11600    3594 : -2.1605e-16 -5.4175e-21  2.0172e-16 : vvel 

@phil-blain
Copy link
Owner Author

With `precond='diag', with thermo, with transport, we get a model abort:

istep1:         2    idate:  20050101    sec:      7200
 (JRA55_data) reading forcing file 1st ts = /home/ords/cmdd/cmde/sice500//CICE_data/forcing/gx3/JRA55/8XDAILY/JRA55_gx3_03hr_forcing_2005.nc
Rank 2 [Mon Jun 13 21:08:20 2022] [c0-0c0s9n1] application called MPI_Abort(MPI_COMM_WORLD, 128) - process 2
    (icepack_warnings_setabort) T :file icepack_itd.F90 :line          900
 (cleanup_itd) aggregate ice area out of bounds
  (cleanup_itd)aice:   1.00245455360081
  (cleanup_itd)n, aicen:           1  0.676531837003209
  (cleanup_itd)n, aicen:           2  0.224493247425031
  (cleanup_itd)n, aicen:           3  4.769818624818129E-002
  (cleanup_itd)n, aicen:           4  3.766552529401467E-002
  (cleanup_itd)n, aicen:           5  1.606575763037559E-002
 (icepack_warnings_aborted) ... (icepack_step_therm2)

weird as I did his test two years ago (#33 (comment)), although with reltol_nonlin=1E-8...)

@JFLemieux73 je garde une trace de mes expériences MPI dans cette issue si tu veux rester au courant.

@phil-blain
Copy link
Owner Author

phil-blain commented Jun 14, 2022

OK, it's because I forgot to also re-enable ridging, the model did not like that (is that expected?...)

EDIT after discussing with JF, yes it is expected, convergence can cause that.

@phil-blain
Copy link
Owner Author

phil-blain commented Jun 14, 2022

OK, with ridging, advection, and transport, reltol=1E-12, precond diag;

    -1 :       Date     Time   Level Gridsize    Miss :     Minimum        Mean     Maximum : Parameter name
     9 : 2005-01-02 00:00:00       0    11600    3594 : -3.9378e-06 -4.8300e-10  4.4782e-06 : aice          
    11 : 2005-01-02 00:00:00       0    11600    3594 : -1.9175e-06 -7.3383e-11  4.1247e-06 : hi            
    13 : 2005-01-02 00:00:00       0    11600    3594 : -2.8908e-07  7.0716e-09  3.3297e-05 : uvel          
    14 : 2005-01-02 00:00:00       0    11600    3594 : -5.4352e-07  6.0686e-09  2.5828e-05 : vvel  

@phil-blain
Copy link
Owner Author

idem with pgmres:

phb001@xc4elogin1(daley): [17:06:02] $ cdo infov diff.nc 2>/dev/null | \grep -E 'name|aice|hi|vel'
    -1 :       Date     Time   Level Gridsize    Miss :     Minimum        Mean     Maximum : Parameter name
     9 : 2005-01-02 00:00:00       0    11600    3594 : -3.8860e-06  3.3100e-09  2.7253e-05 : aice          
    11 : 2005-01-02 00:00:00       0    11600    3594 : -9.5537e-06 -1.0926e-09  2.0109e-06 : hi            
    13 : 2005-01-02 00:00:00       0    11600    3594 : -4.9737e-07  5.9547e-10  2.8301e-06 : uvel          
    14 : 2005-01-02 00:00:00       0    11600    3594 : -2.5361e-06 -5.8569e-10  1.2014e-06 : vvel  

I don't think it's only the preconditoner since these results are similar as the 'diag' precond.

@phil-blain
Copy link
Owner Author

So I did side by side, step by step debugging of 1x1 vs 2x1. The values are the same on both side until the first normalization in the FGMRES algorithm. Since we do a global sum of different numbers, in a different order (that mathematically sum to the same result on all decompositions), then because of floating point arithmetic we get a different norm (of the residual), and then we propagate that to the whole of the vectors by normalizing.

So in the end it is not surprising that we get different results. We will run a QC test of different decompositions against each other to ensure we get the same climate.

@phil-blain
Copy link
Owner Author

phil-blain commented Jun 23, 2022

Résultats mitigés:

80x1 vs 40x1:

INFO:__main__:Running QC test on the following directories:
INFO:__main__:  /home/phb001/data/ppp6/cice/runs/ppp6_intel_smoke_gx1_40x1_medium_qc.qc_40/
INFO:__main__:  /home/phb001/data/ppp6/cice/runs/ppp6_intel_smoke_gx1_80x1_medium_qc.qc_80/
INFO:__main__:Number of files: 1825
INFO:__main__:2 Stage Test Passed
INFO:__main__:Quadratic Skill Test Passed for Northern Hemisphere
INFO:__main__:Quadratic Skill Test Passed for Southern Hemisphere
INFO:__main__:Creating map of the data (ice_thickness_ppp6_intel_smoke_gx1_40x1_medium_qc.qc_40.png)
INFO:__main__:Creating map of the data (ice_thickness_ppp6_intel_smoke_gx1_80x1_medium_qc.qc_80.png)
INFO:__main__:Creating map of the data (ice_thickness_ppp6_intel_smoke_gx1_40x1_medium_qc.qc_40_minus_ppp6_intel_smoke_gx1_80x1_medium_qc.qc_80.png)
INFO:__main__:
INFO:__main__:Quality Control Test PASSED

40x1 vs 24x1:

INFO:__main__:Running QC test on the following directories:
INFO:__main__:  /home/phb001/data/ppp6/cice/runs/ppp6_intel_smoke_gx1_40x1_medium_qc.qc_40/
INFO:__main__:  /home/phb001/data/ppp6/cice/runs/ppp6_intel_smoke_gx1_24x1_medium_qc.qc_24/
INFO:__main__:Number of files: 1825
INFO:__main__:2 Stage Test Passed
INFO:__main__:Quadratic Skill Test Passed for Northern Hemisphere
INFO:__main__:Quadratic Skill Test Failed for Southern Hemisphere
INFO:__main__:Creating map of the data (ice_thickness_ppp6_intel_smoke_gx1_40x1_medium_qc.qc_40.png)
INFO:__main__:Creating map of the data (ice_thickness_ppp6_intel_smoke_gx1_24x1_medium_qc.qc_24.png)
INFO:__main__:Creating map of the data (ice_thickness_ppp6_intel_smoke_gx1_40x1_medium_qc.qc_40_minus_ppp6_intel_smoke_gx1_24x1_medium_qc.qc_24.png)
INFO:__main__:
ERROR:__main__:Quality Control Test FAILED

80x1 vs 24x1:

INFO:__main__:Running QC test on the following directories:
INFO:__main__:  /home/phb001/data/ppp6/cice/runs/ppp6_intel_smoke_gx1_80x1_medium_qc.qc_80/
INFO:__main__:  /home/phb001/data/ppp6/cice/runs/ppp6_intel_smoke_gx1_24x1_medium_qc.qc_24/
INFO:__main__:Number of files: 1825
INFO:__main__:2 Stage Test Passed
INFO:__main__:Quadratic Skill Test Passed for Northern Hemisphere
INFO:__main__:Quadratic Skill Test Failed for Southern Hemisphere
INFO:__main__:Creating map of the data (ice_thickness_ppp6_intel_smoke_gx1_80x1_medium_qc.qc_80.png)
INFO:__main__:Creating map of the data (ice_thickness_ppp6_intel_smoke_gx1_24x1_medium_qc.qc_24.png)
INFO:__main__:Creating map of the data (ice_thickness_ppp6_intel_smoke_gx1_80x1_medium_qc.qc_80_minus_ppp6_intel_smoke_gx1_24x1_medium_qc.qc_24.png)
INFO:__main__:
ERROR:__main__:Quality Control Test FAILED

@phil-blain
Copy link
Owner Author

OK that was a false alarm, the 24x1 run hit walltime and was killed, but the history folder contained outputs from an older run done with EVP, so that the quality control script was using results from two different runs (and that, fortunately, failed!).

I re-ran the 24x1 with a longer walltime and both comparisons (with 40x1 and 80x1) now pass)

@phil-blain
Copy link
Owner Author

phil-blain commented Jul 5, 2022

I put some more thought into the problem of reproducibility for the global sums, after a comment by @dupontf regarding performing the global sum using quadruple precision.

It turns out we already have that capability in CICE, and also even better algorithms: https://cice-consortium-cice.readthedocs.io/en/master/developer_guide/dg_other.html?highlight=reprosum#reproducible-sums

I looked more closely at the code and realized I could leverage this capability with only slight modifications. With these modifications done, running 1x1 and 2x1 side by side, I can verify that the global sums done in the dynamics solver are the same on both side, at least for this configuration:

  • maxits_nonlin = 4
  • maxits_fgmres = 1
  • precond = diag.
  • history_precision = 8
  • npt = 1

With these settings the restarts are bit4bit!

note that I had to also add dump_last = .true. in the namelist for the code to create a restart at the end of the run, if not it defaults to dump_freq = 1d and the scripts would use these older restart from a previous run before I changed npt to do a single time step

@phil-blain
Copy link
Owner Author

Also passes (b4b) after 1 day (24 time steps)

@phil-blain
Copy link
Owner Author

And as expected, with precond=pgmres it still fails as we skip some halo updates.

@phil-blain
Copy link
Owner Author

And it passes with precond=pgmres if we add back those halo updates.

@phil-blain
Copy link
Owner Author

phil-blain commented Jul 14, 2022

So in preparation of a PR with these changes, (CICE-Consortium/CICE@main...phil-blain:CICE:repro-vp), I'm noticing the new code is noticeably slower than the old.

EDIT: original version is https://github.com/phil-blain/CICE/commits/repro-vp@%7B2022-07-14%7D

This is a little bit surprising ...

  • Old code took 3.5 hours to do the 5 years of the QC simulation
  • New code took 4 hours to do 3.5 years.

Note that this is without bfbflag.... so the differences are:

  • twice the number of global sums (X + Y)
  • more global comms because of the move the of the global sum inside the first CGS loop
  • global_sum_prod loops through the whole arrays. not just ice points as calc_L2norm_squared was doing (loop on icellu)

@phil-blain
Copy link
Owner Author

OK so I played with Intel Trace Analyzer and collector (ITAC) and Intel Vtune, for both versions (old and new) of the code, following this tutorial:
https://www.intel.com/content/www/us/en/develop/documentation/itac-vtune-mpi-openmp-tutorial-lin/top.html

First, running Application Performance Snapshot reveals both versions are MPI bound, and have very poor vectorization (note that both runs are 40x1):

old
image

new

image

This reveals however that it's not only the added communications that slow the new code, since "MPI time" is 31%, vs. 43% for the old code.

I then ran the VTune "HPC Performance Characterization Analysis" for both versions and used the "Compare" feature. This a ranking of the hotspots for time difference between the new and old versions (right column, CPU Time: Difference), with the corresponding timings for those functions in the new code (CPU Time: mod-vtune-g):

image

  • I confirmed by running under GDB (mpirun -gdb) that MPIDI_SHMGR_release_generic is called (amonsgt other MPI subroutines) by MPI_ALLREDUCE. So in the VP solver, it is only called by global_sum_prod to actually perform the MPI reduction. Notice that the time difference for that function is almost half of the number for the new code, which makes sense since the new code has approx. twice the number of calls to MPI_ALLREDUCE of the old code (since the new code does one global sum for X and another for Y components). I write "approx." because there is also more calls because of the modified CGS loop. My analysis of these timings is that this modification to the CGS algorithm does not play a big part in the additional time (since the new code spends almost twice as many time in this function, but not a lot more than twice.)

  • The new code spends a lot of time actually computing the local reductions (functions global_sum_prod_dbl and compute_sums_dbl).

@phil-blain
Copy link
Owner Author

phil-blain commented Aug 25, 2022

Getting back to the performance regression after finally getting rid of all the bugs (famous last words) in my new code (see #39 (comment) and following comments).

I re-ran the QC test cases on main (007fbff) and the current tip of my repro-vp branch (579e19f), both 80x1 so using all cores of a single node (i.e. twice the number in my previous test).

  • Old code took 2:25 to simulate the 5 years
  • New code took 2:40 to simulate 3.8 years .... and then died with "bad departure points" on 2008-11-14 :'(

The listings shows that at least uvel is unrealistically large:

 Warning: Departure points out of bounds in remap
 my_task, i, j =          43           8          17
 dpx, dpy =  -45563.7247538909        12271.9759813932
 HTN(i,j), HTN(i+1,j) =   33338.1913820475        33168.9296994831
 HTE(i,j), HTE(i,j+1) =   47781.5593319368        47977.5239199294
 (print_state) bad departure points
 (print_state) istep1, my_task, i, j, iblk:       33867          43           8          17          11
 (print_state) Global block:         884
 (print_state) Global i and j:          31         368
 (print_state) Lat, Lon (degrees):   67.5273801992820       -16.4194498698244

 aice   9.070213892498866E-006
 aice0  0.999990929786107
...
 uvel(i,j)   12.6565902094141
 vvel(i,j)  -3.40888221705366

 atm states and fluxes
             uatm    =  -0.848320343386380
             vatm    =    2.65704819499085
             potT    =    271.607269287109
             Tair    =    271.607269287109
             Qa      =   2.464670687913895E-003
             rhoa    =    1.30000000000000
             swvdr   =   0.000000000000000E+000
             swvdf   =   0.000000000000000E+000
             swidr   =   0.000000000000000E+000
             swidf   =   0.000000000000000E+000
             flw     =    258.931945800781
             frain   =   0.000000000000000E+000
             fsnow   =   4.522630479186773E-005

 ocn states and fluxes
             frzmlt  =   -1000.00000000000
             sst     =    1.18273509903225
             sss     =    34.0000000000000
             Tf      =   -1.90458264992426
             uocn    =   0.000000000000000E+000
             vocn    =   0.000000000000000E+000
             strtltxU=   0.000000000000000E+000
             strtltyU=   0.000000000000000E+000

 srf states and fluxes
             Tref    =   2.460446333168104E-003
             Qref    =   2.265709251383878E-008
             Uref    =   1.428371434883044E-005
             fsens   =   6.097383853908757E-005
             flat    =  -1.995661045829828E-005
             evap    =  -7.027261700667970E-012
             flwout  =  -2.690714878249707E-003


 (abort_ice)ABORTED:
 (abort_ice) error = (diagnostic_abort)ERROR: bad departure points
Abort(128) on node 43 (rank 43 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 43

EDIT I re-ran this from the restart of 2008-01-01, and tweaking the namelist so restarts are written every month. It failed at the same date in the same way. I re-ran it from the restart of 2008-11-01, it again failed at the same date in the same way. I set diagfreq to 1 to check at which time steps it aborts, it is at the 3rd time step of the day.

I changed maxits_nonlin to 6, and this allowed the run to continue without aborting...

@phil-blain
Copy link
Owner Author

phil-blain commented Aug 26, 2022

I checked back in my case directory for my earlier long run (ppp6_intel_smoke_gx1_40x1_dynpicard_medium_qc.40_repro, #40 (comment)) and it turns out I re-ran it with a longer walltime after it hit walltime the first time I ran it.

This second time, I also got "bad departure point", at the same exact location (iglob,jglob=31, 368), but on 2006-11-13 instead of 2008-11-13 (!!!) Soooo weird.

 Finished writing ./history/iceh_inst.2006-11-13-00000.nc

 Warning: Departure points out of bounds in remap
 my_task, i, j =          21          16           9
 dpx, dpy =  -34861.6880018654       -10107.4117000807
 HTN(i,j), HTN(i+1,j) =   33338.1913820475        33168.9296994831
 HTE(i,j), HTE(i,j+1) =   47781.5593319368        47977.5239199294
 istep1, my_task, iblk =       16347          21           8
 Global block:         302
 Global i and j:          31         368

 (abort_ice)ABORTED:
 (abort_ice) error = (horizontal_remap)ERROR: bad departure points
Abort(128) on node 21 (rank 21 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 21

note that the local indices are different since that run was 40x1...

EDIT this is reassuring in a way, because it means:

  1. my buggy bugfix to the new code since then has not influenced this abort
  2. my recent rebase of my repro-vp branch onto the latest main did not either

@phil-blain
Copy link
Owner Author

This second time, I also got "bad departure point", at the same exact location (iglob,jglob=31, 368), but on 2006-11-13 instead of 2008-11-13 (!!!) Soooo weird.

Thinking about it, this probably means that it's something in the forcing, as the QC test cycles the 2005 forcing:

$ \grep -E 'fyear_init|ycycle' configuration/scripts/options/set_nml.qc
fyear_init     = 2005
ycycle         = 1

@phil-blain
Copy link
Owner Author

In DDT, the range of uvel and vvel (min/max as found by the "Statistics" tab of the Multidimensional array viewer) at the start of the nonlinear iterations is very reasonable:

  • [-0.41, 0.37] for uvel
  • [-0.32, 0.32] for vvel

But the range of bx and by (RHS) are more different:

  • [-13.17, 12.22] for bx
  • [-1.10, 5.71] for by

@phil-blain
Copy link
Owner Author

phil-blain commented Sep 28, 2022

Putting that problem aside for now, I re-ran the HPC Performance Characterization Analysis after refactoring the new code to use a single call to MPI_ALLREDUCE instead of two each time (so stacking X and Y components before doing the global sum).

Unfortunately:

  1. it does not seem to have reduced by much the time spent in MPIDI_SHMGR_release_generic
  2. stacking the fields takes so much time that the total running time is longer in this new version...

I'm not able to upload a screenshot to GitHub right now for some reason, I'll try again tomorrow.

EDIT here is what I wanted to show:

sorted by CPU time for the newest code ("repro-pairs", difference on the right):
vtune-280922
sorted by CPU time for the new code ("repro-no-pairs"):
image

@phil-blain
Copy link
Owner Author

OK, I refactored again following comments in CICE-Consortium#763. New timings are very encouraging, no changes in non bfbflag mode, and a small slowdown with bfbflag.

I've re-ran the QC test with this new version of the code (CICE-Consortium/CICE@main...phil-blain:CICE:repro-vp), in the two modes:

  • one with bfbflag=off (default), so the computations are the same as before (global_sum of local scalars): vp-repro-v3/ppp6_intel_smoke_gx1_80x1_dynpicard_medium_qc.test.221006-115019/
  • one with bfbflag=lsum8, so computation are through the new code path (global_sum of an array), but with the default way to compute the local reduction in compute_sums_dbl: vp-repro-v3/ppp6_intel_smoke_gx1_80x1_dynpicard_medium_qc_reprosum.test.lsum8.221006-115329.

ppp6_intel_smoke_gx1_80x1_dynpicard_medium_qc.test.221006-115019

"bad departure points" on 2006-04-15:

 Finished writing ./history/iceh_inst.2006-04-14-00000.nc
  
istep1:     11256    idate:  20060415    sec:         0
  
 Warning: Departure points out of bounds in remap
 my_task, i, j =          33           2           7
 dpx, dpy =  -89431.4326348548        47176.1296418077     
 HTN(i,j), HTN(i+1,j) =   47391.9657405840        47391.9657405840     
 HTE(i,j), HTE(i,j+1) =   59395.4550164216        59395.4550164216     
 (print_state) bad departure points
 (print_state) istep1, my_task, i, j, iblk:       11256          33           2           7           2
 (print_state) Global block:          74
 (print_state) Global i and j:         265          22
 (print_state) Lat, Lon (degrees):  -68.0019337826775       -102.437492832671     
  
 aice   1.863136163916390E-003
 aice0  0.998136863836084     
  
 n =           1
 aicen  4.050043477947559E-004
 vicen  1.074026607958722E-004
 vsnon  2.152998460453039E-005
 hin  0.265188908170192     
 hsn  5.315988512656941E-002
 Tsfcn  -7.05308600110335     
  
  
 n =           2
 aicen  3.907395258195736E-004
 vicen  3.942037081099123E-004
 vsnon  1.085370481838984E-004
 hin   1.00886570736112     
 hsn  0.277773404050288     
 Tsfcn  -7.46888936706571     
  
  
 n =           3
 aicen  3.753718609340793E-004
 vicen  7.077149155167459E-004
 vsnon  1.706338480023648E-004
 hin   1.88537018666146     
 hsn  0.454572827003489     
 Tsfcn  -7.52744021340778     
  
  
 n =           4
 aicen  5.148500074552784E-004
 vicen  1.765957115592584E-003
 vsnon  2.497219361276893E-004
 hin   3.43004193458418     
 hsn  0.485038229603951     
 Tsfcn  -7.53895640784653     
  
  
 n =           5
 aicen  1.771704219127033E-004
 vicen  9.175253715555438E-004
 vsnon  9.635301563084063E-005
 hin   5.17877285412592     
 hsn  0.543843687849411     
 Tsfcn  -7.55704861652791     
  
 qice, cat            1  layer            1  -144191293.640763     
 qi/rhoi  -157242.414003013     
 qice, cat            1  layer            2  -159129292.410324     
 qi/rhoi  -173532.488997082     
 qice, cat            1  layer            3  -171829331.086202     
 qi/rhoi  -187382.040442969     
 qice, cat            1  layer            4  -182841842.222701     
 qi/rhoi  -199391.321944058     
 qice, cat            1  layer            5  -192530961.586381     
 qi/rhoi  -209957.428120372     
 qice, cat            1  layer            6  -202798490.095634     
 qi/rhoi  -221154.296723701     
 qice, cat            1  layer            7  -213716787.463419     
 qi/rhoi  -233060.836928483     
  
 qice, cat            2  layer            1  -199779132.758855     
 qi/rhoi  -217861.649682503     
 qice, cat            2  layer            2  -231920434.545279     
 qi/rhoi  -252912.142361264     
 qice, cat            2  layer            3  -245129831.029287     
 qi/rhoi  -267317.154884718     
 qice, cat            2  layer            4  -251034694.054212     
 qi/rhoi  -273756.482065662     
 qice, cat            2  layer            5  -255262158.260578     
 qi/rhoi  -278366.584798886     
 qice, cat            2  layer            6  -261939135.334309     
 qi/rhoi  -285647.912033052     
 qice, cat            2  layer            7  -280055795.066004     
 qi/rhoi  -305404.356669579     
  
 qice, cat            3  layer            1  -277973318.988761     
 qi/rhoi  -303133.390391233     
 qice, cat            3  layer            2  -269659547.844703     
 qi/rhoi  -294067.118696514     
 qice, cat            3  layer            3  -265950400.949903     
 qi/rhoi  -290022.247491715     
 qice, cat            3  layer            4  -264134014.633383     
 qi/rhoi  -288041.455434442     
 qice, cat            3  layer            5  -263407467.771512     
 qi/rhoi  -287249.146970024     
 qice, cat            3  layer            6  -264616132.339291     
 qi/rhoi  -288567.210838922     
 qice, cat            3  layer            7  -281653330.628534     
 qi/rhoi  -307146.489235042     
  
 qice, cat            4  layer            1  -274022546.007721     
 qi/rhoi  -298825.022909184     
 qice, cat            4  layer            2  -265900352.935201     
 qi/rhoi  -289967.669504036     
 qice, cat            4  layer            3  -263428092.060393     
 qi/rhoi  -287271.638015696     
 qice, cat            4  layer            4  -262729556.198319     
 qi/rhoi  -286509.875897840     
 qice, cat            4  layer            5  -262776163.281876     
 qi/rhoi  -286560.701506953     
 qice, cat            4  layer            6  -263342147.175282     
 qi/rhoi  -287177.914040656     
 qice, cat            4  layer            7  -274398781.855302     
 qi/rhoi  -299235.312819304     
  
 qice, cat            5  layer            1  -270752145.082871     
 qi/rhoi  -295258.609686883     
 qice, cat            5  layer            2  -263974477.967447     
 qi/rhoi  -287867.478699506     
 qice, cat            5  layer            3  -262638225.967348     
 qi/rhoi  -286410.279135604     
 qice, cat            5  layer            4  -263107334.422465     
 qi/rhoi  -286921.847788947     
 qice, cat            5  layer            5  -264012714.478509     
 qi/rhoi  -287909.176094339     
 qice, cat            5  layer            6  -264931147.943505     
 qi/rhoi  -288910.739305895     
 qice, cat            5  layer            7  -272736325.005804     
 qi/rhoi  -297422.382776231     
  
 qice(i,j)  -8608303403.09208     
  
 qsnow, cat            1  layer            1  -113513562.676045     
 qs/rhos  -343980.492957714     
 Tsnow  -4.73907547849646     
  
 qsnow, cat            2  layer            1  -112274135.626779     
 qs/rhos  -340224.653414481     
 Tsnow  -2.95567588531885     
  
 qsnow, cat            3  layer            1  -111471885.336910     
 qs/rhos  -337793.591930030     
 Tsnow  -1.80132570276835     
  
 qsnow, cat            4  layer            1  -111451023.076751     
 qs/rhos  -337730.372959851     
 Tsnow  -1.77130719840959     
  
 qsnow, cat            5  layer            1  -111398409.078330     
 qs/rhos  -337570.936600999     
 Tsnow  -1.69560142497558     
  
 qsnow(i,j)  -560109015.794814     
  
 sice, cat            1  layer            1   18.5497370693517     
 sice, cat            1  layer            2   15.9557345221535     
 sice, cat            1  layer            3   13.9033037539197     
 sice, cat            1  layer            4   12.2738437262536     
 sice, cat            1  layer            5   10.9612714615272     
 sice, cat            1  layer            6   9.75189907460159     
 sice, cat            1  layer            7   8.82710021862978     
 sice, cat            2  layer            1   9.69380979534134     
 sice, cat            2  layer            2   5.20949171453501     
 sice, cat            2  layer            3   3.28161271801342     
 sice, cat            2  layer            4   2.32638366968108     
 sice, cat            2  layer            5   1.77068917491420     
 sice, cat            2  layer            6   1.40462704605691     
 sice, cat            2  layer            7   1.15582840975086     
 sice, cat            3  layer            1  3.441128858396138E-002
 sice, cat            3  layer            2  5.583334566015622E-002
 sice, cat            3  layer            3  6.868278425595450E-002
 sice, cat            3  layer            4  8.082159565937108E-002
 sice, cat            3  layer            5  9.504227906209066E-002
 sice, cat            3  layer            6  0.112032282626814     
 sice, cat            3  layer            7  0.122182221196500     
 sice, cat            4  layer            1  3.430162176600550E-002
 sice, cat            4  layer            2  5.598699231879993E-002
 sice, cat            4  layer            3  7.265396146022764E-002
 sice, cat            4  layer            4  9.206751985323419E-002
 sice, cat            4  layer            5  0.116962492121375     
 sice, cat            4  layer            6  0.148280941700695     
 sice, cat            4  layer            7  0.174687166678514     
 sice, cat            5  layer            1  4.437160891411421E-002
 sice, cat            5  layer            2  6.754356854468051E-002
 sice, cat            5  layer            3  9.306091059025316E-002
 sice, cat            5  layer            4  0.127612334716539     
 sice, cat            5  layer            5  0.173411405898090     
 sice, cat            5  layer            6  0.232649443743668     
 sice, cat            5  layer            7  0.293799832484727     
  
 uvel(i,j)   24.8420646207930     
 vvel(i,j)  -13.1044804560577     
  
 atm states and fluxes
             uatm    =    7.83973264694214     
             vatm    =    7.14443778991699     
             potT    =    266.660705566406     
             Tair    =    266.660705566406     
             Qa      =   1.939329667948186E-003
             rhoa    =    1.30000000000000     
             swvdr   =   0.000000000000000E+000
             swvdf   =   0.000000000000000E+000
             swidr   =   0.000000000000000E+000
             swidf   =   0.000000000000000E+000
             flw     =    256.630615234375     
             frain   =   0.000000000000000E+000
             fsnow   =   1.231965143233538E-005
  
 ocn states and fluxes
             frzmlt  =   -1000.00000000000     
             sst     =  -0.563221058256973     
             sss     =    34.0000000000000     
             Tf      =   -1.90458264992426     
             uocn    =   0.000000000000000E+000
             vocn    =   0.000000000000000E+000
             strtltxU=   0.000000000000000E+000
             strtltyU=   0.000000000000000E+000
  
 srf states and fluxes
             Tref    =   0.497702610322736     
             Qref    =   3.649059471071920E-006
             Uref    =   1.948956371724535E-002
             fsens   =   3.964439903809392E-002
             flat    =  -1.349099926321760E-002
             evap    =  -4.748024256393200E-009
             flwout  =  -0.527381737217707     
  
  
 (abort_ice)ABORTED: 
 (abort_ice) error = (diagnostic_abort)ERROR: bad departure points
Abort(128) on node 33 (rank 33 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 33

ppp6_intel_smoke_gx1_80x1_dynpicard_medium_qc_reprosum.test.lsum8.221006-115329

"bad departure points" on 2008-11-13 (same date as above):

 Finished writing ./history/iceh_inst.2008-11-13-00000.nc
  
 Warning: Departure points out of bounds in remap
 my_task, i, j =          43           8          17
 dpx, dpy =  -38954.6534512499       -25171.7289148755     
 HTN(i,j), HTN(i+1,j) =   33338.1913820475        33168.9296994831     
 HTE(i,j), HTE(i,j+1) =   47781.5593319368        47977.5239199294     
 (print_state) bad departure points
 (print_state) istep1, my_task, i, j, iblk:       33867          43           8          17          11
 (print_state) Global block:         884
 (print_state) Global i and j:          31         368
 (print_state) Lat, Lon (degrees):   67.5273801992820       -16.4194498698244     
  
 aice   8.998129733063038E-006
 aice0  0.999991001870267     
  
 n =           1
 aicen  6.166408153407531E-006
 vicen  1.371671230758920E-007
 vsnon  2.051894152446559E-008
 hin  2.224424975828012E-002
 hsn  3.327535416728280E-003
 Tsfcn  -2.94978949808477     
  
  
 n =           2
 aicen  7.099941826167991E-007
 vicen  7.571732102563899E-007
 vsnon  3.729412402735414E-008
 hin   1.06644987916057     
 hsn  5.252736563263177E-002
 Tsfcn  -6.87457430330818     
  
  
 n =           3
 aicen  1.179783885046477E-006
 vicen  2.173329905687473E-006
 vsnon  7.941363015055925E-008
 hin   1.84214239000379     
 hsn  6.731201464701378E-002
 Tsfcn  -7.43936784722567     
  
  
 n =           4
 aicen  7.383969670736533E-007
 vicen  2.345299021999684E-006
 vsnon  7.287445154549425E-008
 hin   3.17620348752834     
 hsn  9.869278287301686E-002
 Tsfcn  -7.98924194641191     
  
  
 n =           5
 aicen  2.035465449185765E-007
 vicen  1.065078906998072E-006
 vsnon  3.339674217574401E-008
 hin   5.23260617085949     
 hsn  0.164074227784625     
 Tsfcn  -8.05718393169389     
  
 qice, cat            1  layer            1  -134114512.690995     
 qi/rhoi  -146253.558005447     
 qice, cat            1  layer            2  -137245811.209414     
 qi/rhoi  -149668.278309067     
 qice, cat            1  layer            3  -138074685.651941     
 qi/rhoi  -150572.176283469     
 qice, cat            1  layer            4  -138615718.832379     
 qi/rhoi  -151162.179751777     
 qice, cat            1  layer            5  -139647806.220230     
 qi/rhoi  -152287.683991527     
 qice, cat            1  layer            6  -137134604.812257     
 qi/rhoi  -149547.006338339     
 qice, cat            1  layer            7  -119866283.234378     
 qi/rhoi  -130715.685097468     
  
 qice, cat            2  layer            1  -264082182.538482     
 qi/rhoi  -287984.931884931     
 qice, cat            2  layer            2  -258605851.995290     
 qi/rhoi  -282012.924749498     
 qice, cat            2  layer            3  -257159375.800275     
 qi/rhoi  -280435.524318729     
 qice, cat            2  layer            4  -256213323.905183     
 qi/rhoi  -279403.842862795     
 qice, cat            2  layer            5  -254917292.887028     
 qi/rhoi  -277990.504784108     
 qice, cat            2  layer            6  -251719223.950059     
 qi/rhoi  -274502.970501700     
 qice, cat            2  layer            7  -241844975.175298     
 qi/rhoi  -263734.978380914     
  
 qice, cat            3  layer            1  -262910969.087461     
 qi/rhoi  -286707.708928529     
 qice, cat            3  layer            2  -259993594.924048     
 qi/rhoi  -283526.275816847     
 qice, cat            3  layer            3  -258721761.256272     
 qi/rhoi  -282139.325252205     
 qice, cat            3  layer            4  -256998556.017034     
 qi/rhoi  -280260.148328282     
 qice, cat            3  layer            5  -254228366.082832     
 qi/rhoi  -277239.221464375     
 qice, cat            3  layer            6  -248873140.956305     
 qi/rhoi  -271399.281304585     
 qice, cat            3  layer            7  -236903948.318805     
 qi/rhoi  -258346.726629013     
  
 qice, cat            4  layer            1  -260723419.435865     
 qi/rhoi  -284322.158599634     
 qice, cat            4  layer            2  -261188818.447261     
 qi/rhoi  -284829.682058082     
 qice, cat            4  layer            3  -259703731.768740     
 qi/rhoi  -283210.176410839     
 qice, cat            4  layer            4  -256438998.059903     
 qi/rhoi  -279649.943358673     
 qice, cat            4  layer            5  -250779761.834650     
 qi/rhoi  -273478.475283151     
 qice, cat            4  layer            6  -240812281.018365     
 qi/rhoi  -262608.812451870     
 qice, cat            4  layer            7  -221381004.458880     
 qi/rhoi  -241418.761678168     
  
 qice, cat            5  layer            1  -263180165.881072     
 qi/rhoi  -287001.271407931     
 qice, cat            5  layer            2  -262057641.851250     
 qi/rhoi  -285777.144875954     
 qice, cat            5  layer            3  -259642261.670439     
 qi/rhoi  -283143.142497752     
 qice, cat            5  layer            4  -255142469.069725     
 qi/rhoi  -278236.062235250     
 qice, cat            5  layer            5  -246468029.930808     
 qi/rhoi  -268776.477569038     
 qice, cat            5  layer            6  -229342590.286182     
 qi/rhoi  -250100.970868247     
 qice, cat            5  layer            7  -199301944.726038     
 qi/rhoi  -217341.270148351     
  
 qice(i,j)  -7974035103.98514     
  
 qsnow, cat            1  layer            1  -111850289.231331     
 qs/rhos  -338940.270397974     
 Tsnow  -2.34580740644548     
  
 qsnow, cat            2  layer            1  -114039075.191528     
 qs/rhos  -345572.955125843     
 Tsnow  -5.49523035415153     
  
 qsnow, cat            3  layer            1  -114430599.292543     
 qs/rhos  -346759.391795586     
 Tsnow  -6.05859059619481     
  
 qsnow, cat            4  layer            1  -114779537.984019     
 qs/rhos  -347816.781769754     
 Tsnow  -6.56067510434653     
  
 qsnow, cat            5  layer            1  -114272673.891737     
 qs/rhos  -346280.829974960     
 Tsnow  -5.83135326446358     
  
 qsnow(i,j)  -569372175.591159     
  
 sice, cat            1  layer            1   20.2199089222525     
 sice, cat            1  layer            2   19.7796379588405     
 sice, cat            1  layer            3   19.5425664182877     
 sice, cat            1  layer            4   19.3695520400775     
 sice, cat            1  layer            5   19.1220019983790     
 sice, cat            1  layer            6   19.2504047186910     
 sice, cat            1  layer            7   21.0537447311902     
 sice, cat            2  layer            1   7.30450537072847     
 sice, cat            2  layer            2   8.34503156955870     
 sice, cat            2  layer            3   9.01434221320233     
 sice, cat            2  layer            4   9.42490589431130     
 sice, cat            2  layer            5   9.68872759926925     
 sice, cat            2  layer            6   10.0215411680662     
 sice, cat            2  layer            7   10.6293880430649     
 sice, cat            3  layer            1   7.89036051946482     
 sice, cat            3  layer            2   8.77482868295520     
 sice, cat            3  layer            3   9.32735705319974     
 sice, cat            3  layer            4   9.68496440713753     
 sice, cat            3  layer            5   10.0141348422266     
 sice, cat            3  layer            6   10.5840616583562     
 sice, cat            3  layer            7   11.4582699427706     
 sice, cat            4  layer            1   10.0090552217314     
 sice, cat            4  layer            2   9.97659382521376     
 sice, cat            4  layer            3   9.98121624821825     
 sice, cat            4  layer            4   10.1028251455134     
 sice, cat            4  layer            5   10.4543489692913     
 sice, cat            4  layer            6   11.3768727784806     
 sice, cat            4  layer            7   12.9025059949524     
 sice, cat            5  layer            1   9.95929460551644     
 sice, cat            5  layer            2   9.93124990759990     
 sice, cat            5  layer            3   9.95343881720853     
 sice, cat            5  layer            4   10.1928707849040     
 sice, cat            5  layer            5   10.8583013186153     
 sice, cat            5  layer            6   12.3650416588220     
 sice, cat            5  layer            7   14.8938537499649     
  
 uvel(i,j)   10.8207370697916     
 vvel(i,j)   6.99214692079876     
  
 atm states and fluxes
             uatm    =  -0.848320343386380     
             vatm    =    2.65704819499085     
             potT    =    271.607269287109     
             Tair    =    271.607269287109     
             Qa      =   2.464670687913895E-003
             rhoa    =    1.30000000000000     
             swvdr   =   0.000000000000000E+000
             swvdf   =   0.000000000000000E+000
             swidr   =   0.000000000000000E+000
             swidf   =   0.000000000000000E+000
             flw     =    258.931945800781     
             frain   =   0.000000000000000E+000
             fsnow   =   4.522630479186773E-005
  
 ocn states and fluxes
             frzmlt  =   -1000.00000000000     
             sst     =    1.18235130585605     
             sss     =    34.0000000000000     
             Tf      =   -1.90458264992426     
             uocn    =   0.000000000000000E+000
             vocn    =   0.000000000000000E+000
             strtltxU=   0.000000000000000E+000
             strtltyU=   0.000000000000000E+000
  
 srf states and fluxes
             Tref    =   2.440707262935295E-003
             Qref    =   2.245595718767618E-008
             Uref    =   1.412707772499829E-005
             fsens   =   6.076961697553104E-005
             flat    =  -1.955840982400989E-005
             evap    =  -6.887109315733744E-012
             flwout  =  -2.668255669750822E-003
  
  
 (abort_ice)ABORTED: 
 (abort_ice) error = (diagnostic_abort)ERROR: bad departure points
Abort(128) on node 43 (rank 43 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 43

For both cases, bump maxits_nonlin to 5 instead of 4 allows the run to continue, and QC then passes against the main simulation (done with maxits_nonlin=4) as well as a new run with maxits_nonlin=5 (ppp6_intel_smoke_gx1_80x1_dynpicard_medium_nonlin5_qc.221006-154627, ppp6_intel_smoke_gx1_80x1_dynpicard_medium_nonlin5_qc.221006-154716/)

@phil-blain
Copy link
Owner Author

In both cases, restarting from the time step before the abort, and settings coriolis = 'zero' allows the run to continue.

@phil-blain
Copy link
Owner Author

In both cases, the cell where it fails is right on the ice edge.

@phil-blain
Copy link
Owner Author

In both cases, bumping dim_pgmres (number of inner iterations of the PGMRES preconditioner) from 5 to 10 allows the run to continue, keeping maxits_nonlin=4.

@phil-blain
Copy link
Owner Author

phil-blain commented Oct 12, 2022

In both cases, dropping the linear tolerance (reltol_fgmres) from 1E-2 to 1E-1 allows the run to continue (!)

@phil-blain phil-blain changed the title Verify b4b-ness of different MPI decompositions for the VP solver VP solver robustness issues ("bad departure points") (was: Verify b4b-ness of different MPI decompositions for the VP solver / performance evaluation of repro-vp branch) Oct 18, 2022
@phil-blain phil-blain added this to the Picard solver milestone Oct 18, 2022
@phil-blain
Copy link
Owner Author

phil-blain commented Oct 21, 2022

The change of default parameters was implemented in CICE-Consortium#774. I'm keeping this open since the underlying robustness issue is not solved.

Discussing with JF, it seems the preconditioner is probably not doing a good enough job, which leads to the FGMRES solver having trouble converging...

@sunshyne4ever

This comment was marked as off-topic.

@sunshyne4ever

This comment was marked as off-topic.

@sunshyne4ever

This comment was marked as off-topic.

@sunshyne4ever

This comment was marked as off-topic.

@sunshyne4ever

This comment was marked as off-topic.

@sunshyne4ever

This comment was marked as off-topic.

@sunshyne4ever

This comment was marked as off-topic.

phil-blain added a commit that referenced this issue Mar 7, 2023
* merge latest master (#4)

* Isotopes for CICE (CICE-Consortium#423)

Co-authored-by: apcraig <anthony.p.craig@gmail.com>
Co-authored-by: David Bailey <dbailey@ucar.edu>
Co-authored-by: Elizabeth Hunke <eclare@lanl.gov>

* updated orbital calculations needed for cesm

* fixed problems in updated orbital calculations needed for cesm

* update CICE6 to support coupling with UFS

* put in changes so that both ufsatm and cesm requirements for potential temperature and density are satisfied

* Convergence on ustar for CICE. (CICE-Consortium#452) (#5)

* Add atmiter_conv to CICE

* Add documentation

* trigger build the docs

Co-authored-by: David A. Bailey <dbailey@ucar.edu>

* update icepack submodule

* Revert "update icepack submodule"

This reverts commit e70d1ab.

* update comp_ice.backend with temporary ice_timers fix

* Fix threading problem in init_bgc

* Fix additional OMP problems

* changes for coldstart running

* Move the forapps directory

* remove cesmcoupled ifdefs

* Fix logging issues for NUOPC

* removal of many cpp-ifdefs

* fix compile errors

* fixes to get cesm working

* fixed white space issue

* Add restart_coszen namelist option

* update icepack submodule

* change Orion to orion in backend

remove duplicate print lines from ice_transport_driver

* add -link_mpi=dbg to debug flags (#8)

* cice6 compile (#6)

* enable debug build. fix to remove errors

* fix an error in comp_ice.backend.libcice

* change Orion to orion for machine identification

* changes for consistency w/ current emc-cice5 (#13)

Update to emc/develop fork to current CICE consortium 

Co-authored-by: David A. Bailey <dbailey@ucar.edu>
Co-authored-by: Tony Craig <apcraig@users.noreply.github.com>
Co-authored-by: Elizabeth Hunke <eclare@lanl.gov>
Co-authored-by: Mariana Vertenstein <mvertens@ucar.edu>
Co-authored-by: apcraig <anthony.p.craig@gmail.com>
Co-authored-by: Philippe Blain <levraiphilippeblain@gmail.com>

* Fixcommit (#14)

Align commit history between emc/develop and cice-consortium/master

* Update CICE6 for integration to S2S


* add wcoss_dell_p3 compiler macro

* update to icepack w/ debug fix

* replace SITE with MACHINE_ID

* update compile scripts

* Support TACC stampede (#19)

* update icepack

* add ice_dyn_vp module to CICE_InitMod

* update gitmodules, update icepack

* Update CICE to consortium master (#23)

updates include:

* deprecate upwind advection (CICE-Consortium#508)
* add implicit VP solver (CICE-Consortium#491)

* update icepack

* switch icepack branches

* update to icepack master but set abort flag in ITD routine
to false

* update icepack

* Update CICE to latest Consortium master (#26)


update CICE and Icepack

* changes the criteria for aborting ice for thermo-conservation errors
* updates the time manager
* fixes two bugs in ice_therm_mushy
* updates Icepack to Consortium master w/ flip of abort flag for troublesome IC cases

* add cice changes for zlvs (#29)

* update icepack and pointer

* update icepack and revert gitmodules

* Fix history features

- Fix bug in history time axis when sec_init is not zero.
- Fix issue with time_beg and time_end uninitialized values.
- Add support for averaging with histfreq='1' by allowing histfreq_n to be any value
  in that case.  Extend and clean up construct_filename for history files.  More could
  be done, but wanted to preserve backwards compatibility.
- Add new calendar_sec2hms to converts daily seconds to hh:mm:ss.  Update the
  calchk calendar unit tester to check this method
- Remove abort test in bcstchk, this was just causing problems in regression testing
- Remove known problems documentation about problems writing when istep=1.  This issue
  does not exist anymore with the updated time manager.
- Add new tests with hist_avg = false.  Add set_nml.histinst.

* revert set_nml.histall

* fix implementation error

* update model log output in ice_init

* Fix QC issues

- Add netcdf ststus checks and aborts in ice_read_write.F90
- Check for end of file when reading records in ice_read_write.F90 for
  ice_read_nc methods
- Update set_nml.qc to better specify the test, turn off leap years since we're cycling
  2005 data
- Add check in c ice.t-test.py to make sure there is at least 1825 files, 5 years of data
- Add QC run to base_suite.ts to verify qc runs to completion and possibility to use
  those results directly for QC validation
- Clean up error messages and some indentation in ice_read_write.F90

* Update testing

- Add prod suite including 10 year gx1prod and qc test
- Update unit test compare scripts

* update documentation

* reset calchk to 100000 years

* update evp1d test

* update icepack

* update icepack

* add memory profiling (#36)


* add profile_memory calls to CICE cap

* update icepack

* fix rhoa when lowest_temp is 0.0

* provide default value for rhoa when imported temp_height_lowest
(Tair) is 0.0
* resolves seg fault when frac_grid=false and do_ca=true

* update icepack submodule

* Update CICE for latest Consortium master (#38)


    * Implement advanced snow physics in icepack and CICE
    * Fix time-stamping of CICE history files
    * Fix CICE history file precision

* Use CICE-Consortium/Icepack master (#40)

* switch to icepack master at consortium

* recreate cap update branch (#42)


* add debug_model feature
* add required variables and calls for tr_snow

* remove 2 extraneous lines

* remove two log print lines that were removed prior to
merge of driver updates to consortium

* duplicate gitmodule style for icepack

* Update CICE to latest Consortium/main (#45)

* Update CICE to Consortium/main (CICE-Consortium#48)


Update OpenMP directives as needed including validation via new omp_suite. Fixed OpenMP in dynamics.
Refactored eap puny/pi lookups to improve scalar performance
Update Tsfc implementation to make sure land blocks don't set Tsfc to freezing temp
Update for sea bed stress calculations

* fix comment, fix env for orion and hera

* replace save_init with step_prep in CICE_RunMod

* fixes for cgrid repro

* remove added haloupdates

* baselines pass with these extra halo updates removed

* change F->S for ocean velocities and tilts

* fix debug failure when grid_ice=C

* compiling in debug mode using -init=snan,arrays requires
initialization of variables

* respond to review comments

* remove inserted whitespace for uvelE,N and vvelE,N

* Add wave-cice coupling; update to Consortium main (CICE-Consortium#51)


* add wave-ice fields
* initialize aicen_init, which turns up as NaN in calc of floediam
export
* add call to icepack_init_wave to initialize wavefreq and dwavefreq
* update to latest consortium main (PR 752)

* add initializationsin ice_state

* initialize vsnon/vsnon_init and vicen/vicen_init

Co-authored-by: apcraig <anthony.p.craig@gmail.com>
Co-authored-by: David Bailey <dbailey@ucar.edu>
Co-authored-by: Elizabeth Hunke <eclare@lanl.gov>
Co-authored-by: Mariana Vertenstein <mvertens@ucar.edu>
Co-authored-by: Minsuk Ji <57227195+MinsukJi-NOAA@users.noreply.github.com>
Co-authored-by: Tony Craig <apcraig@users.noreply.github.com>
Co-authored-by: Philippe Blain <levraiphilippeblain@gmail.com>
phil-blain added a commit that referenced this issue Sep 29, 2023
…ICE-Consortium#856)

* merge latest master (#4)

* Isotopes for CICE (CICE-Consortium#423)

Co-authored-by: apcraig <anthony.p.craig@gmail.com>
Co-authored-by: David Bailey <dbailey@ucar.edu>
Co-authored-by: Elizabeth Hunke <eclare@lanl.gov>

* updated orbital calculations needed for cesm

* fixed problems in updated orbital calculations needed for cesm

* update CICE6 to support coupling with UFS

* put in changes so that both ufsatm and cesm requirements for potential temperature and density are satisfied

* Convergence on ustar for CICE. (CICE-Consortium#452) (#5)

* Add atmiter_conv to CICE

* Add documentation

* trigger build the docs

Co-authored-by: David A. Bailey <dbailey@ucar.edu>

* update icepack submodule

* Revert "update icepack submodule"

This reverts commit e70d1ab.

* update comp_ice.backend with temporary ice_timers fix

* Fix threading problem in init_bgc

* Fix additional OMP problems

* changes for coldstart running

* Move the forapps directory

* remove cesmcoupled ifdefs

* Fix logging issues for NUOPC

* removal of many cpp-ifdefs

* fix compile errors

* fixes to get cesm working

* fixed white space issue

* Add restart_coszen namelist option

* update icepack submodule

* change Orion to orion in backend

remove duplicate print lines from ice_transport_driver

* add -link_mpi=dbg to debug flags (#8)

* cice6 compile (#6)

* enable debug build. fix to remove errors

* fix an error in comp_ice.backend.libcice

* change Orion to orion for machine identification

* changes for consistency w/ current emc-cice5 (#13)

Update to emc/develop fork to current CICE consortium 

Co-authored-by: David A. Bailey <dbailey@ucar.edu>
Co-authored-by: Tony Craig <apcraig@users.noreply.github.com>
Co-authored-by: Elizabeth Hunke <eclare@lanl.gov>
Co-authored-by: Mariana Vertenstein <mvertens@ucar.edu>
Co-authored-by: apcraig <anthony.p.craig@gmail.com>
Co-authored-by: Philippe Blain <levraiphilippeblain@gmail.com>

* Fixcommit (#14)

Align commit history between emc/develop and cice-consortium/master

* Update CICE6 for integration to S2S


* add wcoss_dell_p3 compiler macro

* update to icepack w/ debug fix

* replace SITE with MACHINE_ID

* update compile scripts

* Support TACC stampede (#19)

* update icepack

* add ice_dyn_vp module to CICE_InitMod

* update gitmodules, update icepack

* Update CICE to consortium master (#23)

updates include:

* deprecate upwind advection (CICE-Consortium#508)
* add implicit VP solver (CICE-Consortium#491)

* update icepack

* switch icepack branches

* update to icepack master but set abort flag in ITD routine
to false

* update icepack

* Update CICE to latest Consortium master (#26)


update CICE and Icepack

* changes the criteria for aborting ice for thermo-conservation errors
* updates the time manager
* fixes two bugs in ice_therm_mushy
* updates Icepack to Consortium master w/ flip of abort flag for troublesome IC cases

* add cice changes for zlvs (#29)

* update icepack and pointer

* update icepack and revert gitmodules

* Fix history features

- Fix bug in history time axis when sec_init is not zero.
- Fix issue with time_beg and time_end uninitialized values.
- Add support for averaging with histfreq='1' by allowing histfreq_n to be any value
  in that case.  Extend and clean up construct_filename for history files.  More could
  be done, but wanted to preserve backwards compatibility.
- Add new calendar_sec2hms to converts daily seconds to hh:mm:ss.  Update the
  calchk calendar unit tester to check this method
- Remove abort test in bcstchk, this was just causing problems in regression testing
- Remove known problems documentation about problems writing when istep=1.  This issue
  does not exist anymore with the updated time manager.
- Add new tests with hist_avg = false.  Add set_nml.histinst.

* revert set_nml.histall

* fix implementation error

* update model log output in ice_init

* Fix QC issues

- Add netcdf ststus checks and aborts in ice_read_write.F90
- Check for end of file when reading records in ice_read_write.F90 for
  ice_read_nc methods
- Update set_nml.qc to better specify the test, turn off leap years since we're cycling
  2005 data
- Add check in c ice.t-test.py to make sure there is at least 1825 files, 5 years of data
- Add QC run to base_suite.ts to verify qc runs to completion and possibility to use
  those results directly for QC validation
- Clean up error messages and some indentation in ice_read_write.F90

* Update testing

- Add prod suite including 10 year gx1prod and qc test
- Update unit test compare scripts

* update documentation

* reset calchk to 100000 years

* update evp1d test

* update icepack

* update icepack

* add memory profiling (#36)


* add profile_memory calls to CICE cap

* update icepack

* fix rhoa when lowest_temp is 0.0

* provide default value for rhoa when imported temp_height_lowest
(Tair) is 0.0
* resolves seg fault when frac_grid=false and do_ca=true

* update icepack submodule

* Update CICE for latest Consortium master (#38)


    * Implement advanced snow physics in icepack and CICE
    * Fix time-stamping of CICE history files
    * Fix CICE history file precision

* Use CICE-Consortium/Icepack master (#40)

* switch to icepack master at consortium

* recreate cap update branch (#42)


* add debug_model feature
* add required variables and calls for tr_snow

* remove 2 extraneous lines

* remove two log print lines that were removed prior to
merge of driver updates to consortium

* duplicate gitmodule style for icepack

* Update CICE to latest Consortium/main (#45)

* Update CICE to Consortium/main (CICE-Consortium#48)


Update OpenMP directives as needed including validation via new omp_suite. Fixed OpenMP in dynamics.
Refactored eap puny/pi lookups to improve scalar performance
Update Tsfc implementation to make sure land blocks don't set Tsfc to freezing temp
Update for sea bed stress calculations

* fix comment, fix env for orion and hera

* replace save_init with step_prep in CICE_RunMod

* fixes for cgrid repro

* remove added haloupdates

* baselines pass with these extra halo updates removed

* change F->S for ocean velocities and tilts

* fix debug failure when grid_ice=C

* compiling in debug mode using -init=snan,arrays requires
initialization of variables

* respond to review comments

* remove inserted whitespace for uvelE,N and vvelE,N

* Add wave-cice coupling; update to Consortium main (CICE-Consortium#51)


* add wave-ice fields
* initialize aicen_init, which turns up as NaN in calc of floediam
export
* add call to icepack_init_wave to initialize wavefreq and dwavefreq
* update to latest consortium main (PR 752)

* add initializationsin ice_state

* initialize vsnon/vsnon_init and vicen/vicen_init

* Update CICE (CICE-Consortium#54)


* update to include recent PRs to Consortium/main

* fix for nudiag_set

allow nudiag_set to be available outside of cesm; may prefer
to fix in coupling interface

* Update CICE for latest Consortium/main (CICE-Consortium#56)

* add run time info

* change real(8) to real(dbl)kind)

* fix syntax

* fix write unit

* use cice_wrapper for ufs timer functionality

* add elapsed model time for logtime

* tidy up the wrapper

* fix case for 'time since' at the first advance

* add timer and forecast log

* write timer values to timer log, not nu_diag
* write log.ice.fXXX

* only one time is needed

* modify message written for log.ice.fXXX

* change info in fXXX log file

* Update CICE from Consortium/main (CICE-Consortium#62)


* Fix CESMCOUPLED compile issue in icepack. (CICE-Consortium#823)
* Update global reduction implementation to improve performance, fix VP bug (CICE-Consortium#824)
* Update VP global sum to exclude local implementation with tripole grids
* Add functionality to change hist_avg for each stream (CICE-Consortium#827)
* Update Icepack to #6703bc533c968 May 22, 2023 (CICE-Consortium#829)
* Fix for mesh check in CESM driver (CICE-Consortium#830)
* Namelist option for time axis position. (CICE-Consortium#839)

* reset timer after Advance to retrieve "wait time"

* add logical control for enabling runtime info

* remove zsal items from cap

* fix typo

---------

Co-authored-by: apcraig <anthony.p.craig@gmail.com>
Co-authored-by: David Bailey <dbailey@ucar.edu>
Co-authored-by: Elizabeth Hunke <eclare@lanl.gov>
Co-authored-by: Mariana Vertenstein <mvertens@ucar.edu>
Co-authored-by: Minsuk Ji <57227195+MinsukJi-NOAA@users.noreply.github.com>
Co-authored-by: Tony Craig <apcraig@users.noreply.github.com>
Co-authored-by: Philippe Blain <levraiphilippeblain@gmail.com>
Co-authored-by: Jun.Wang <Jun.Wang@noaa.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants