Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add run time info #114

Merged
merged 9 commits into from
Sep 7, 2023
Merged

add run time info #114

merged 9 commits into from
Sep 7, 2023

Conversation

junwang-noaa
Copy link

Add run time information in MOM6 NUOPC cap.

@junwang-noaa
Copy link
Author

@jiandewang @DeniseWorthen Please let me know if you have any suggestions/concerns.

@codecov-commenter
Copy link

codecov-commenter commented Jul 5, 2023

Codecov Report

Merging #114 (3a0b235) into dev/emc (55e0472) will increase coverage by 4.17%.
Report is 459 commits behind head on dev/emc.
The diff coverage is 38.34%.

❗ Current head 3a0b235 differs from pull request most recent head 328766d. Consider uploading reports for the commit 328766d to get more accurate results

@@             Coverage Diff             @@
##           dev/emc     #114      +/-   ##
===========================================
+ Coverage    33.99%   38.17%   +4.17%     
===========================================
  Files          259      269      +10     
  Lines        70267    76503    +6236     
  Branches     13020    14064    +1044     
===========================================
+ Hits         23885    29202    +5317     
- Misses       41880    42054     +174     
- Partials      4502     5247     +745     
Files Changed Coverage Δ
...g_src/drivers/solo_driver/MESO_surface_forcing.F90 0.00% <0.00%> (ø)
...g_src/drivers/solo_driver/user_surface_forcing.F90 0.00% <0.00%> (ø)
...c/external/GFDL_ocean_BGC/generic_tracer_utils.F90 0.00% <0.00%> (ø)
...src/external/database_comms/MOM_database_comms.F90 0.00% <0.00%> (ø)
...ernal/database_comms/database_client_interface.F90 0.00% <0.00%> (ø)
config_src/infra/FMS1/MOM_coms_infra.F90 59.47% <0.00%> (+24.86%) ⬆️
config_src/infra/FMS1/MOM_io_infra.F90 22.00% <0.00%> (+2.06%) ⬆️
src/ALE/MOM_hybgen_regrid.F90 0.00% <0.00%> (ø)
src/ALE/MOM_hybgen_remap.F90 0.00% <0.00%> (ø)
src/ALE/P1M_functions.F90 0.00% <0.00%> (ø)
... and 185 more

... and 36 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@jiandewang
Copy link
Collaborator

@junwang-noaa have you tried this code in S2S run to make sure you get information as you expected ?

@jiandewang
Copy link
Collaborator

@junwang-noaa I tried to use your MOM6 branch in UWM and run one case (cpld_bmark_p8_intel) but it died right away in ocean PE. Is there any setting I need to modify ?
see run directory at /scratch1/NCEPDEV/stmp2/Jiande.Wang/FV3_RT/rt_97485/cpld_bmark_p8_intel

@junwang-noaa
Copy link
Author

I will take a look, thanks

@jiandewang
Copy link
Collaborator

@junwang-noaa UWM code and script at /scratch1/NCEPDEV/climate/Jiande.Wang/working/MOM6-add-cap-Jun/ufs-weather-model

@junwang-noaa
Copy link
Author

junwang-noaa commented Aug 17, 2023

@jiandewang The issue is fixed and I ran the RT on hera and the test is running fine. Would you please review the code? Thanks

@jiandewang
Copy link
Collaborator

@junwang-noaa do you keep the run log on HERA ?

@junwang-noaa
Copy link
Author

/scratch1/NCEPDEV/nems/Jun.Wang/nems/vlab/20230705/ufs-weather-model/tests/logs/RegressionTests_hera.log

Please let me know if you see any issue, I am going to run full RT after that

@jiandewang
Copy link
Collaborator

/scratch1/NCEPDEV/nems/Jun.Wang/nems/vlab/20230705/ufs-weather-model/tests/logs/RegressionTests_hera.log

Please let me know if you see any issue, I am going to run full RT after that

@junwang-noaa I don't see any issue grom the "out" file in your run dir, you may start full RT

@DeniseWorthen
Copy link
Collaborator

DeniseWorthen commented Aug 17, 2023

Won't this be written for all users? There doesn't seem a way to do this optionally.

@jiandewang
Copy link
Collaborator

@DeniseWorthen do you mean you are worrying that NCAR side will see these information in their run log?

@DeniseWorthen
Copy link
Collaborator

DeniseWorthen commented Aug 17, 2023

Yes, this might not be something another user of the nuopc mom cap needs or wants. I had started to try to get a wrapper mod working for MOM (like we are trying to do for CICE) but it isn't working completely yet. Wrapping everything in 'if (.not.cesmcoupled) will be really messy.

@jiandewang
Copy link
Collaborator

One thing I can do is provide Jun's branch to NCAR and ask them whether they feel those run time information is a hassle for them or not

@junwang-noaa
Copy link
Author

junwang-noaa commented Aug 18, 2023

Thanks for checking, Jiande. @DeniseWorthen I am not sure if we want to isolate the code with a wrapper. I do see some print lines in the initialization/finalize in the original code. I thought these info could benefit everyone to understand/monitor computational performance, but I also understand people may have their own preference.

@jiandewang
Copy link
Collaborator

Thanks for checking, Jiande. @DeniseWorthen I am not sure if we want to isolate the code with a wrapper. I do see some print lines in the initialization/finalize in the original code. I thought these info could benefit everyone to understand computational performance, but I also understand people may have their own preference.

@junwang-noaa let me contact NCAR and see what's their feedback

@alperaltuntas
Copy link

I have two quick comments:

  • Rather than MPI_Wtime, the MOM_cpu_clock module routines should be used to measure and log elapsed time.
  • Instead of write(*,*), you should have write(stdout,*) as is done in the rest of the code. Otherwise, MOM6 logs go to the wrong log files in CESM.

@junwang-noaa
Copy link
Author

junwang-noaa commented Aug 18, 2023

@alperaltuntas Thanks for your comments.

  1. I made changes for stdout, I removed the timing in InitializeP0 subroutine as "stdout" is not defined yet in that subroutine and it is called only once and timing is small.

  2. Here we are looking for temporal run time information that allows us to configure the MPI tasks for MOM6 in coupled runs (run time in every step and waiting time between run steps). So far I only saw a statistics information for the whole integration from the MOM_cpu_clock, maybe I missed something. Any suggestions?

@DeniseWorthen
Copy link
Collaborator

@junwang-noaa Should this information be written to a MOM specific log file vs just stdout for UFS?

@junwang-noaa
Copy link
Author

I feel either way is fine as the UFS stdout already has the MOM print info.

@alperaltuntas
Copy link

@junwang-noaa, As an example of how the MOM_cpu_clock is used to measure elapsed time of a particular code segment, search for id_clock_forcing in drivers/nuopc_cap/mom_surface_forcing_nuopc.F90. Note the usage of cpu_clock_id, cpu_clock_begin and cpu_clock_end function/subroutine calls. When you utilize these routines, the timing information will automatically end up in the log file and you won't have to worry about expicitly writing out timing.

@alperaltuntas
Copy link

I am also mentioning @marshallward in case he has further comments regarding the usage of clocks.

@junwang-noaa
Copy link
Author

junwang-noaa commented Aug 21, 2023

@alperaltuntas @marshallward I did look at the MOM_cpu_clock which uses the MPP_clock. Sorry, I did not make it clear. I am trying to get timing information for the NUOPC phases including initialization (where the MOM_cpu_clock is not available at the beginning of the initialization phase yet) and run/finalize phases. For the run phase, I need to get timing information for every time step (e.g. timing variance at different IO output time including history file, restart file etc). I can add cpu_clock_begin and cpu_clock_end in the run phase, but I can only get a total run time at the end of integration when the mpp_exit is called. Also I want to catch the waiting time of MOM6 grid component during coupling time step ( the time difference between MOM leaves a NUOPC run step and the next time it enters the run step), I am not sure if the MOM_cpu_clock can be used for this purpose.

@marshallward
Copy link

@junwang-noaa As long as the FMS mpp_init() has been called, the mpp_clock_(begin|end) functions should be available. The output from these clocks will also appear alongside the MOM6 clock output, which is generated by FMS. (As you mention above, cpu_clock_(begin|end) are the MOM interfaces to the FMS clocks. You would probably only use those inside of MOM6.)

@marshallward
Copy link

I just looked at the code, I agree with @alperaltuntas that there should be no need to use MPI_Wtime inside of the cap. Times measured inside the cap should use cpu_clock_*.

I don't know how your coupler works, but it sounds like you want to log wait times outside of the cap. Is that right? In that case, I think it is a job for your coupler, not MOM, so you should use timer provided by the coupler.

If the coupler happens to have access to FMS, then you could possibly use mpp_clock_*, but I suspect you probably want a model-wide solution, or perhaps just MPI_Wtime.

@junwang-noaa
Copy link
Author

junwang-noaa commented Aug 21, 2023

@marshallward Thanks for looking into this issue. One question is that I need to have the run time information at every time step, but the mpp_clock can only output the total time for each clock_id at the end of integration. How do I get that timing information at every integration time step?

In UFS coupler runs on different MPI tasks, checking the waiting for other component requires additional communication, I think it is the simplest if we get the timing on each component itself just use mpi_wtime. Is it possible that we can implement this capability with a wrapper so that other applications that uses the NUOPC cap mom_cap will not be impacted? Thanks for any suggestions.

@marshallward
Copy link

@junwang-noaa You are right, the FMS timers are only integrated and won't provide per-time steps; if that is what you need, then you would also need an alternative timer.

If I were to do this, I would consider using the system_clock() intrinsic rather than an MPI call. I am a bit concerned that MPI_Wtime()'s awareness of MPI ranks outside of the MOM communicator could cause problems. For example, if the global timer sync were enabled, then the model could hang. Since you only report the root PE time, it should be safe to use system_clock().

I would also suggest wrapping this in some sort of debug flag, to avoid excessive output to stdout. Saving to a file might also be useful.

In fact, given that every timer check happens at the beginning and end of each function, it still feels natural for the times to reside in the functions which call them, and might be better positioned to report them with respect to the other submodels. But perhaps that is just a subjective opinion.

Having said all that, cap maintainers are the ones responsible for this code, and are free to make any changes that they deem necessary, so don't feel obliged to adopt any of these suggestions. Whatever you decide will be fine from our perspective.

@alperaltuntas
Copy link

I agree with Marshall's suggestions, and especially the suggestion for wrapping the write calls in a switchable flag (that should be false by default.)

@junwang-noaa
Copy link
Author

junwang-noaa commented Aug 23, 2023

@marshallward @alperaltuntas Thank you very much for your comments. I have updated the code to

  1. using a nems configuration variable: RunTimeLog to control if the run time information will be output.

  2. In mom_cap.F90, there is an option to have the mom6 run log going to a specific log file:
    https://github.com/NOAA-EMC/MOM6/blob/dev/emc/config_src/drivers/nuopc_cap/mom_cap.F90#L456
    So when the RunTimeLog is turned on, the timing info will go to a user specified file.

  3. I tried both MPI_Wtime and system.clock to see the timing difference. I used:

integer(8)           :: counti=0
integer(8)           :: countf=0
integer(8)           :: count_rate=1
...
  if(write_runtimelog) call system_clock(counti,count_rate)
...
  if(write_runtimelog) then
call system_clock(countf)
if(is_root_pe()) write(stdout,*) 'In ',trim(subname),' time ', REAL(countf-counti)/REAL(count_rate)

I see some differences compared to MPI_Wtime(), below - lines are from MPI_Wtime() while +lines are from system.clock.

- In (MOM_cap:InitializeRealize) time   0.208017795346677
+In (MOM_cap:InitializeRealize) time   0.208026000000000
- In (MOM_cap:DataInitialize) time   7.365932688117027E-004
+In (MOM_cap:DataInitialize) time   1.171000000000000E-003
- In (MOM_cap:ModelAdvance) time   6.539979949593544E-005
+In (MOM_cap:ModelAdvance) time   6.900000000000000E-005

To be consistent with other model components, I'd like to use MPI_Wtime(). Since these calls are within MOM6 NUOPC component, they are called on the MPI ranks within the communicator of the MOM6 grid component vm. So I think they won't be called outside MOM6 tasks. If MOM6 runs on global communicator, it will still work and the timing is what we want to get.

Please let me know if you have any further questions/suggestions. Really appreciate you reviewed the code.

@marshallward
Copy link

marshallward commented Aug 23, 2023

Thanks for posting the numbers, that's an interesting comparison. system_clock is usually a thin wrapper to clock_gettime() in Linux (GFortran does this, for example), although compilers can do whatever they choose. MPI_Wtime appears to use gettimeofday() in MPICH, which is older than clock_gettime and is considered more problematic. But again, the MPI library can use whatever they like.

Given that your other submodels use MPI_Wtime, it suppose it makes more sense for you to stay consistent. But if this affects any other potential NUOPC users, we may need to come back to this decision.

@jiandewang
Copy link
Collaborator

@alperaltuntas do you think this PR is in good shape now ? If yes can you make a test run on your system ? Thanks

@alperaltuntas
Copy link

It all looks good and our tests are passing. Thanks for checking with us!

@jiandewang
Copy link
Collaborator

It all looks good and our tests are passing. Thanks for checking with us!

@alperaltuntas thank you very much.

@jkbk2004
Copy link

jkbk2004 commented Sep 6, 2023

@junwang-noaa I wonder if draft status can be converted to PR to get approvals from @jiandewang

@jiandewang
Copy link
Collaborator

@junwang-noaa I think you can convert it to formal PR since NCAR side has no objection on it.

@junwang-noaa junwang-noaa marked this pull request as ready for review September 6, 2023 14:39
@junwang-noaa
Copy link
Author

@jiandewang Done.

@FernandoAndrade-NOAA
Copy link

Testing on #1823 has completed successfully, please continue the merge process for this PR. Thank you.

@jiandewang jiandewang merged commit be40a41 into NOAA-EMC:dev/emc Sep 7, 2023
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants