Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow regional runs with nuopc vs mct #1907

Closed
jkshuman opened this issue Nov 17, 2022 · 26 comments
Closed

slow regional runs with nuopc vs mct #1907

jkshuman opened this issue Nov 17, 2022 · 26 comments

Comments

@jkshuman
Copy link
Contributor

jkshuman commented Nov 17, 2022

In testing #1892 the regional runs are significantly slower than similar runs with mct.

Serious of 10 day runs with allow us to investigate changes in setup and timing.

  • original setup (similar to mct) for South America FATES runs
  • NTASKS=-1
  • NTASKS_PER_INST (skip per Bill)
  • PIO_TYPENAME=pnetcdf

case directories:

  • "original setup ntasks=8, SROF nuopc": /glade/scratch/jkshuman/archive/setup_orig_SROF_SAmer_543e4243a_d63b8d21
  • "modify ntasks=-1, SROF nuopc": /glade/work/jkshuman/FATES_cases/test/setup_ntasks_SROF_SAmer_543e4243a_d63b8d21
  • "orig ntasks= 8, piotype=pnetcdf, SROF nuopc": /glade/work/jkshuman/FATES_cases/test/setup_origntasks_piotype_SROF_SAmer_543e4243a_d63b8d21
  • "modify ntasks= -1, piotype=pnetcdf, SROF nuopc": /glade/work/jkshuman/FATES_cases/test/setup_ntasks_piotype_SROF_SAmer_543e4243a_d63b8d21
  • "BSacks rec for ntasks=288,ntasks_atm=36, rootpe=36,rootpe_atm=0, piotype=pnetcdf": /glade/work/jkshuman/FATES_cases/test/setup_bsacks_ntasks_rootpe_piotype_SROF_SAmer_543e4243a_d63b8d21
  • "original mct SROF": /glade/work/jkshuman/FATES_cases/test/setup_mct_orig_SROF_SA_616905bbb_d63b8d21

Mosart compset (nullMosart that did not set to null...)

  • "original setup ntasks:8 nuopc": /glade/work/jkshuman/FATES_cases/test/setup_orig_nullMosart_SAmer_543e4243a_d63b8d21
  • "modify ntasks:-1 nuopc": /glade/work/jkshuman/FATES_cases/test/setup_ntasks_nullMosart_SAmer_543e4243a_d63b8d21
  • "modify ntasks:-1, piotype:pnetcdf nuopc, nullMosart": /glade/work/jkshuman/FATES_cases/test/setup_ntasks_piotype_nullMosart_SAmer_543e4243a_d63b8d21

tagging @billsacks @ekluzek @slevisconsulting for discussion

definition of done: recommendations for regional case setup

@billsacks
Copy link
Member

Thanks for starting this, @jkshuman . I would skip the NTASKS_PER_INST setting... that may be relevant for a multi-instance / multi-driver case, but otherwise I think that just will confuse this investigation. (Unless others know something that I'm missing.)

@jkshuman
Copy link
Contributor Author

jkshuman commented Nov 17, 2022

@billsacks I added a run which removes the change to ntasks_per_inst

  • path "modify ntasks & piotype nuopc": /glade/work/jkshuman/FATES_cases/test/setup_ntasks_piotype_removeinst_SA_543e4243a_d63b8d21

@jkshuman
Copy link
Contributor Author

adding @billsacks comments on the potential culprit here:
Following up from the discussion at this morning's ctsm-software meeting, I did a bit of digging as to how PIO_TYPENAME is set. It looks like what's happening is:

(1) Jackie's runs use a resolution of USRDAT

(2) The USRDAT resolution sets the default PE layout to use a single task

(3) This block of code sets PIO_TYPENAME to netcdf when using a single task:

https://github.com/ESMCI/cime/blob/28b7431c2424a345785f4aa2391c5cf9bd9f837b/CIME/case/case.py#L1589-1595

One workaround for this, besides explicitly setting PIO_TYPENAME via an xmlchange, is to specify --pecount on the create_newcase line so that the number of tasks is set to something greater than 1 from the start, which in turn keeps PIO_TYPENAME as pnetcdf. Off-hand, I'm struggling a bit to come up with a robust way to get the correct settings out of the box for various situations. Would it make sense to have separate USRDAT_1pt and USRDAT_regional (or something like that) just for the sake of setting the default PE layouts differently??? I imagine that might be messy, though. Another option might be to move the above CIME code to case.setup time, so as long as you change the PE layout before your first call to case.setup, it wouldn't be invoked – but changing that setting at that point might come with its own issues (e.g., confusion if the user has already tried to manually set it), so I'm not sure whether that's a good idea. This might take some more brainstorming.

@billsacks
Copy link
Member

@slevisconsulting asked:

If the default for USRDAT were opposite, i.e. pnetcdf for single task cases, would that cause trouble in some other way? If not, then that could be a reasonable solution. Otherwise, I like the idea of making --pecount a required input when selecting --res USRDAT.

That's a good question. It occurred to me and I assumed that it would be a problem for some reason that motivated the change in CIME, but I'll check with Jim Edwards to see.

@billsacks
Copy link
Member

I found this note from Jim from 6 years ago:

at low processor counts, your performance is better using netcdf, but at high processor counts you're better using pnetcdf

which is probably what inspired the use of netcdf for single-processor cases.

@billsacks
Copy link
Member

@jkshuman - I think there was something wrong with your case setup scripts. The original case has NTASKS=8 for all components - which maybe was what you intended, though I seem to remember that your real original case had NTASKS_ATM=36 (i.e., 1 full node). But the other cases all have NTASKS=1... maybe you didn't change that setting?

If you want to avoid redoing multiple tests, you could just do one additional test that is exactly like the original (i.e., NTASKS=8) but with PIO_TYPENAME=pnetcdf. Or you could do two additional tests: one with NTASKS=36 (or equivalently NTASKS=-1) and one with that setting plus PIO_TYPENAME=pnetcdf.

@billsacks
Copy link
Member

I did take a quick look through your timing file from the original case, though, and I think this might just be a matter of needing to throw more processors at it: almost all of the time is spent in CTSM (not DATM or the coupler), and of the lnd run time, 58% is spent in canflux, which I don't think involves any I/O. So I don't think that tweaking I/O settings will make much difference, and the main thing you can do to speed it up is to throw more processors at it – at least one full node (36 processors), and maybe more given the size of this region.

If you do have an apples-to-apples comparison of a nuopc and mct case that shows that the nuopc case is running significantly slower, that would be interesting to see. I'd be surprised based on what I can see in the timing file, but sometimes surprises can be fun :-)

@jkshuman
Copy link
Contributor Author

@billsacks I was using a MOSART compset so that may have contributed. the mct case is getting underway. will update the paths when the apples to apples comparisons are done.
but it's possible this was an errant setup doomed to be slow

@jkshuman
Copy link
Contributor Author

@billsacks I updated the nuopc runs, but I am having a hard time getting a successful mct run. I am going to abandon mct for the moment, but happy to try again if there are any ideas. I tried with CLM5.1, CLM5.0 for mct but keep getting a variation of this error:

256:MPT ERROR: Rank 256(g:256) is aborting with error code 2.
256: Process ID: 37639, Host: r10i4n31, Program: /glade/scratch/jkshuman/test8_setup_mct_orig_nullMosart_atm1_CLM50_SA_616905bbb_d63b8d21/bld/cesm.exe
256: MPT Version: HPE MPT 2.22 03/31/20 15:59:10
256:
256:MPT: --------stack traceback-------
156:MCT::m_SparseMatrixPlus:: FATAL--length of vector y different from row count of sMat.Length of y = 2610 Number of rows in sMat = 55296
156:09C.MCT(MPEU)::die.: from MCT::m_SparseMatrixPlus::initDistributed_()
95:MPT ERROR: Rank 95(g:95) is aborting with error code 2.
95: Process ID: 45698, Host: r10i3n1, Program: /glade/scratch/jkshuman/test8_setup_mct_orig_nullMosart_atm1_CLM50_SA_616905bbb_d63b8d21/bld/cesm.exe
95: MPT Version: HPE MPT 2.22 03/31/20 15:59:10
95:
95:MPT: --------stack traceback-------
164:0A4.MCT(MPEU)::die.: from MCT::m_SparseMatrixPlus::initDistributed_()
149:MPT ERROR: Rank 149(g:149) is aborting with error code 2.
149: Process ID: 32096, Host: r10i2n32, Program: /glade/scratch/jkshuman/test8_setup_mct_orig_nullMosart_atm1_CLM50_SA_616905bbb_d63b8d21/bld/cesm.exe
149: MPT Version: HPE MPT 2.22 03/31/20 15:59:10
149:
149:MPT: --------stack traceback-------
-1:MPT ERROR: MPI_COMM_WORLD rank 323 has terminated without calling MPI_Finalize()
-1: aborting job

@billsacks
Copy link
Member

That looks like a problem with mapping files being inconsistent with the domain. 55296 is the size of a 0.9x1.25 domain, I think, so you probably have one or more files (e.g., mapping files) that are from that rather than your regional domain.

If you have an old MCT case sitting around that showed the kind of speed you're expecting, you can point me to that and no need to try to reproduce this exactly apples-to-apples.

@olyson
Copy link
Contributor

olyson commented Nov 18, 2022

Indeed, part of the problem may be that mosart is still on, as evidenced by an rof log file in your run directory, despite the NULL setting. I've had problems turning off mosart using NULL after creating a case. This might work, I don't remember:
./case.setup --reset
./case.build --clean-all
qcmd -- ./case.build
Or you might have to go back to using SROF in your compset longname when creating the case.

@jkshuman
Copy link
Contributor Author

@olyson I had tried running with SROF and got a fail, so just went back to MOSART. Will try again with SROF for a long name compset.

@jkshuman
Copy link
Contributor Author

Thanks for the discussion on this. (@billsacks I must have had a typo in my path for the mct domain directory, but I fixed the domain path and have a somewhat comparable mct run listed above). To complete this set of tests I ran these with SROF using the long name compset. I had fails with the alias I2000Clm51FatesRs, but have not explored that fail further. The test paths are updated at the top of the issue.

@billsacks is the minimum recommendation for these region subsets

  • use SROF compset unless MOSART active
  • PIO_TYPENAME=pnetcdf
  • JOB_QUEUE=REGULAR (any recommendation here?)
  • NTASKS=-1 (Is there a recommendation here, or will that be based on the user's case?)

@ekluzek
Copy link
Collaborator

ekluzek commented Nov 18, 2022

The important of using the regular JOB_QUEUE is to prevent the use of the SHARED queue, which can have impacts on performance since you are sharing the node with others, so what they do can have greater impacts on your program run. Technically the only way to get no interference from others is to have a dedicated machine, which isn't really possible. But, at least not being in the shared queue is important.

NTASKS==-1, this will depend on the users case and you can go up to the number of grid cells in your regional domain. NTASKS==-1 means use a full node, which on cheyenne is 36 tasks. You do normally want to use full nodes, so using the minus syntax is helpful for that.

@billsacks
Copy link
Member

Sorry for my delay in getting back to this. I looked at the new cases (though not the ones where you used MOSART with NULL mode), and this really looks like it's mainly (or entirely) a matter of needing to throw more processors at the problem. The switch from netcdf to pnetcdf doesn't help for your original 8-task case and in fact makes things a little slower – though that could just be machine variability, or it could be that pnetcdf really only helps for larger processor counts. The main contributor to the run time is CTSM (not DATM or the coupler); of this, the main contributor is can_iter, but fates_wrap_update_hifrq_hist, surfalb and surfrad also take significant time. This looks to me like it's probably a computational or memory bottleneck in CTSM, not anything to do with using CMEPS or CDEPS. When you increased from 8 to 36 processors, the runtime improved considerably, though not quite linearly with the number of processors.

The MCT case you ran uses a very different processor layout:

	NTASKS: ['CPL:288', 'ATM:36', 'LND:288', 'ICE:288', 'OCN:288', 'ROF:288', 'GLC:288', 'WAV:288', 'IAC:1', 'ESP:288']

	ROOTPE: ['CPL:36', 'ATM:0', 'LND:36', 'ICE:36', 'OCN:36', 'ROF:36', 'GLC:36', 'WAV:36', 'IAC:0', 'ESP:0']

It looks like what's going on here is that, when you setup the MCT case, you started with an f09 case and then changed things from there; whereas with the NUOPC case, you started with USRDAT resolution and then changed things from there. So I guess the out-of-the-box PE layout for the MCT case was similar to that of an f09 case, though I'm confused as to why you got this particular PE layout, because our standard f09 layout on cheyenne uses a lot more processors than that. Whatever the explanation, I have a feeling that the differences you're seeing are due to the differences in processor counts more than anything else. I'm not sure if this totally explains the differences, because the MCT case gives a 13x improvement in land run time for a 8x increase in processors. But I have seen better-than-linear scaling like this before when there are memory bottlenecks, so I wouldn't be surprised if most / all of this can be explained by the different processor layout.

So can you try the following in a NUOPC case?:

./xmlchange NTASKS=288
./xmlchange NTASKS_ATM=36
./xmlchange ROOTPE=36
./xmlchange ROOTPE_ATM=0
./xmlchange PIO_TYPENAME=pnetcdf

@billsacks
Copy link
Member

is the minimum recommendation for these region subsets

  • use SROF compset unless MOSART active
  • PIO_TYPENAME=pnetcdf
  • JOB_QUEUE=REGULAR (any recommendation here?)
  • NTASKS=-1 (Is there a recommendation here, or will that be based on the user's case?)

@ekluzek partly replied to this, but yes, I think this is right, though as @ekluzek says the ntasks recommendation would depend on the number of grid cells.

@jkshuman
Copy link
Contributor Author

@billsacks updated this with a nuopc case using your recommendations.

./xmlchange NTASKS=288
./xmlchange NTASKS_ATM=36
./xmlchange ROOTPE=36
./xmlchange ROOTPE_ATM=0
./xmlchange PIO_TYPENAME=pnetcdf

case: /glade/work/jkshuman/FATES_cases/test/setup_bsacks_ntasks_rootpe_piotype_SROF_SAmer_543e4243a_d63b8d21

@billsacks
Copy link
Member

Thanks a lot @jkshuman . That NUOPC case gets closer to the timing of the MCT case, but is still slower – lnd run time of 0.67 sec/day instead of 0.40 sec/day. But I think I see why: It looks like your MCT case uses a domain file with nearly 1/2 of the grid cells masked out as ocean, whereas the regional mesh file used in your NUOPC case has a mask that is 1 everywhere.

This leads to the following:

For the MCT case:

  Surface Grid Characteristics
    longitude points          =           45
    latitude points           =           58
    total number of gridcells =         1322
    total number of landunits =         2618
    total number of columns   =         5986
    total number of patches   =        24494
    total number of cohorts   =      1983000

For the NUOPC case:

  Surface Grid Characteristics
    longitude points          =           45
    latitude points           =           58
    total number of gridcells =         2610
    total number of landunits =         5158
    total number of columns   =         8666
    total number of patches   =        45206
    total number of cohorts   =      3915000

I'm not sure what the recommended way (if any) is for setting the mask on the NUOPC mesh file. But for now, a quick test to confirm that this explains most / all of the timing difference would be to rerun the MCT case changing fatmlndfrc (via the xml variables LND_DOMAIN_PATH and LND_DOMAIN_FILE) to point to a modified version of /glade/work/jkshuman/sfcdata/domain.lnd.fv0.9x1.25_gx1v6.SA.nc, where you change the "mask" variable to be 1 everywhere. Then the MCT case would be using a mask consistent with the NUOPC case; I know this isn't the mask you want to be using, but it would give us more of an apples-to-apples comparison.

@jkshuman
Copy link
Contributor Author

@billsacks thanks for looking at this. It makes sense. Will update here when I get around to modifying the MCT domain file as suggested for that comparison. It seems that a longer term solution would be to modify the subset script to mask out the ocean.

@ekluzek
Copy link
Collaborator

ekluzek commented Nov 30, 2022

@jkshuman Yes, the subset script could modify the mask based on the global mask from the mesh file. The other way to do this now, would be to use the mesh_modifier script to get the right mask on the regional mesh file. That's something you could do with the existing tool. Although there's a bit to figuring out how to use it. But, @slevisconsulting or I could help if you need some guidance...

@jkshuman
Copy link
Contributor Author

@ekluzek Thanks for that guidance. I will look into using the existing mesh_modifier script, and get in touch for help.

@billsacks
Copy link
Member

the subset script could modify the mask based on the global mask from the mesh file

That may have been part of the original motivation for subsetting the global mask rather than creating a new mask from scratch based on the coordinates.

@niuhanlin
Copy link

Is there a recommended setting if you only run CLM-FATES with MCT?

Here's my setup, but it's still slow.
./xmlchange NTASKS_CPL=3
./xmlchange NTHRDS_CPL=64
./xmlchange ROOTPE_CPL=0
./xmlchange NTASKS_ATM=5
./xmlchange NTHRDS_ATM=64
./xmlchange ROOTPE_ATM=0
./xmlchange NTASKS_LND=10
./xmlchange NTHRDS_LND=64
./xmlchange ROOTPE_LND=0

@jkshuman
Copy link
Contributor Author

@niuhanlin we recommend that you switch over to NUOPC, and then optimize based on your specific needs. MCT is no longer supported.

On that note, I should close this issue, as it is likely the additional ocean tiles that cause the slower performance. Any disagreement on closing this @ekluzek @billsacks

@jkshuman
Copy link
Contributor Author

Recommendation was to test this again w/ the ocean tiles masked out. I have not had time to perform that test. But based on discussion and review by @billsacks that is the key difference.

@billsacks
Copy link
Member

@niuhanlin - in most cases, we recommend using a single thread (NTHRDS_* = 1). As @jkshuman says, this issue is essentially resolved, and I agree that it can be closed. @niuhanlin if you want further support, I recommend opening a new Discussion topic (https://github.com/ESCOMP/CTSM/discussions) or forum post (https://bb.cgd.ucar.edu/cesm/). There, please give details on your configuration so we can give you better guidance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants