-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate using parallel IO #34
Comments
I've been looking at the CICE PIO code. It is not as complete as the serial netcdf code, for example it doesn't do error proper checking. The PIO code still exists and is documented in CICE6. My next step is to see whether it can be built on raijin. Another option, which may be better even if PIO works is to take the MOM5 approach and have each PE output to it's own file followed by an offline collate. The advantage of this would be that we can continue to use the existing netcdf code (with slight modifications). The down-side would be that we need to write a collate program. |
@nichannah I think with the moves by Ed Hartnett wrt implementing PIO in FMS I think it would be best to go the PIO route to stay reasonably compatible with future FMS and CICEn. |
Steps to build PIO cice.
I also tried openmpi/1.10.2 but the build failed with link errors.
|
I looks like the CICE PIO code makes use of something called shr_pio_mod. Getting compile error like:
The code can be found here: https://github.com/CESM-Development/cesm-git-experimental/tree/master/cesm/models/csm_share |
Netcdf 4.7.1 is now installed on raijin on top of hdf5/1.10.5. The parallel version, 4.7.1p (and hdf5/1.10.5p), is built with openmpi/4.0.1. |
Followed my instructions as above with new versions and the configure step hangs. This seems to be caused by:
The following works:
This is the hanging command:
For the time being using old compiler versions to try to get things working. |
Current status is that PIO is building, need to modify CICE PIO support so that it works without CESM dependencies. The main difficulty here is that the CICE PIO code assumes that initialisation has already been done somewhere else (perhaps as part of a coupled model). So proper PIO initialisation needs to be written. |
The PIO code is ready to be tested however there is a problem with netcdf, compiler and openmpi version compatibility between the new CICE and the rest of the model. So this issue is now dependent on upgrading these things. |
In ACCESS-OM2 sea ice concentration is passed to MOM via OASIS I couldn't find a relevant diagnostic here |
I've put this in the WOMBAT version but I've been holding off on issueing a pull request until @nichannah updates the way he proposes to pass new fields. https://github.com/russfiedler/MOM5/blob/wombat/src/mom5/ocean_core/ocean_sbc.F90#L5971 |
Also as a note to above. netCDF on gadi should be suitable for PIO |
Updated PIO build instructions:
Note that logging is enabled above. This will need to be changed in production. To build using Cmake: CC=mpicc FC=mpif90 cmake -DWITH_PNETCDF=OFF -DNetCDF_C_LIBRARY="${NETCDF}/lib/ompi3/libnetcdf.so" -DNetCDF_C_INCLUDE_DIR="${NETCDF}/include/" -DNetCDF_Fortran_LIBRARY="${NETCDF}/lib/ompi3/Intel" -DNetCDF_Fortran_INCLUDE_DIR="${NETCDF}/include/Intel" ../ |
preliminary results from a 10 day 0.1 run with daily cice output. previously writing output was 15% of CICE runtime, it’s now 6%. mom now spending less than half as much time waiting on ice. from 12% or runtime down to 5% the interesting thing now is to see how this scales. Presumably the existing approach will not scale well as we increase the number of CICE cpus. It would be good to see whether we can increase the number of CICE cpus to further reduce the MOM wait time. Aim to get this below 1% |
Thanks @nichannah, that's great news. Did you run your test with 799 CICE cores? And am I right in thinking CICE with PIO uses all cores (rather than a subset like MOM io_layout)? If so, I'm a little surprised it didn't speed up more, if there are 799x more cores doing the output. I guess there's some extra overhead in PIO? @marshallward's tests on Raijin showed CICE would scale well up to about 2000 cores and is still reasonable at 3000 (see table below). If so, I guess we'd need over 4000 CICE cores to get below 1% MOM wait time, which seems rather a lot. But in our standard configs (serial CICE io, monthly outputs) MOM spends just under 2% of its time waiting for CICE, so 1% is better than we're used to. |
Thanks @aekiss, that's useful. I'm now running a test to see how a run with daily output compares to one with monthly output. If that is OK then perhaps we can start to use this feature before spending more time on optimisation. |
I believe PIO allows some sort of flexibilty with which PEs are used https://ncar.github.io/ParallelIO/group___p_i_o__init.html . I don't know how flexible this is in what has been written for CICE. There is an interesting point made in the FAQ that it's sometimes worth moving the IO away from the root PE/task (and I presume node) due to the heavier load there. |
Yes, it looks like there's some configuration optimisation that we can do with this. Presently I'm just using the simplest config which is a stride of 1 - so all procs are writing output. I have just completed two 2 month runs:
Basically 1) is doing about 8Gb per month and 2) is doing 8Gb per day. The runtime of these two runs is almost identical. Looking at ice_diag.d the time taken for writing out history is similar but the PIO case is about 5% slower. See
Incidentally, there seems to be something strange happening with the atm halo timers in the new PIO run. The mean time in the PIO run is 6 seconds but for the regular run it is 106 seconds. A possible explanation for this is that the PEs within CICE are better matched so collective operations don't have to wait as long on lagging PEs. So this new feature should allow daily ice output with no performance penalty over the existing configuration. I think it makes sense to merge this into master. Any objections? @aekiss? Future work will involve looking at the scaling and performance of the whole model in more detail and at that point I can look at the different configuration options of PIO if ice output is a bottleneck. |
That's great that daily output can be done with nearly the same runtime. If you're confident that the output with PIO is bitwise identical to the non-PIO version then I see no reason not to merge into master, given that it makes daily output practical. Also is compressed output still possible with PIO? |
I agree that filling land with 0 seems the better option, rather than hoping we remember this gotcha into the indefinite future... |
The solution to this is not completely satisfactory. The obvious way to get netcdf to put 0's in places where no data is written is to set _FillValue = 0. This can be a bit confusing because there is no difference between "no data" and "data with value 0". However I think this is probably still better than the alternative which is needing to fix-up CICE restarts whenever the PE layout changes. See attached Tsfcn, the white has value 0 and the red is mostly -1.8. |
For example the point (474, 2613) is land but unmasked so you could check its value for every field in every restart file in |
CF conventions allow for http://cfconventions.org/cf-conventions/cf-conventions.html#missing-data |
Thanks @nichannah, I'm closing this issue now. We decided in the 14 Oct TWG meeting that this issue with restarts is not significant enough to warrant fixing, and that a fix with a change to We just need to remember to fill in the cpu-masked cells with zero values if restarting with a changed cpu layout. I've done a test run at 0.1deg with PIO (using commit 7c74942) to compare to one without PIO (using commit 26e6159). |
Sorry @nic, I'm reopening again - I've hit a bug using PIO in a 1deg configuration. For the 1deg config I'm using one core per chunk, laid out the same way as
I have repeated identical runs
and got differing output in these files and variables:
Note that this issue only appears in multi-category variables (e.g. For example here's category 0 of The problem occurs in different places in other fields. I've only seen this problem in category 0, but I haven't checked thoroughly. I didn't see this issue with the 0.1deg config. Maybe I need better choices for |
Oops, apologies @nichannah - this was just because I was calling When I use
in |
The OpenMPI docs say |
On Gadi it appears that romio is used by default. Also we need to specify the number of MPI aggregators explicitly to avoid the heuristic/algorithm that usually sets this. This algorithm appears to get confused with the combination of (chunksize != tile size) and deflation on. The confusion leads to a divide-by-zero. I haven't spent the time to really understand this bug/problem so you could say that |
Thanks for the explanation @nichannah |
@nichannah FYI: PIO seems to slow down CICE at 1 deg. Fraction of MOM runtime in oasis_recv, Max CICE I/O time (s) but it is improved at 0.25deg: The cice cores are spread between nodes on gadi at 1deg with 1+216+24 cores for yatm/mom/cice so that might be part of the problem: COSIMA/access-om2#212 and COSIMA/access-om2#202 |
…MOM5#317; CICE uses PIO: COSIMA/cice5#34); configuration changes to support PIO in CICE
…MOM5#317; CICE uses PIO: COSIMA/cice5#34); configuration changes to support PIO in CICE
…MOM5#317; CICE uses PIO: COSIMA/cice5#34); configuration changes to support PIO in CICE
…MOM5#317; CICE uses PIO: COSIMA/cice5#34); configuration changes to support PIO in CICE
…MOM5#317; CICE uses PIO: COSIMA/cice5#34); configuration changes to support PIO in CICE
I've also tried 1 deg ( 1 deg with 4 chunks is almost as fast as the 24-chunk case (though slower than without PIO) but should be faster to read in most circumstances than 24 chunks. However I'm thinking a 180x150 4-chunk layout is probably a better match to hemisphere-based access patterns so I might try that too. This run was for 5 years, rather than 3mo as in the previous and next posts so I haven't included Max CICE I/O time. It's a bit faster in a 3mo test - see next post. 0.25 deg with 4 chunks is now somewhat slower than without PIO but I'm reluctant to use too many chunks in case it slows down reading. Note that this run was for 2 years, rather than 3mo as in the previous and next posts. Also I should have mentioned that these 1 deg and 0.25 deg tests all had identical ice outputs, but they differ from the ice outputs in the production 0.1deg runs I reported here so they aren't directly comparable to those. |
Some more tests of differing Fraction of MOM runtime in oasis_recv, Max CICE I/O time (s) The first of these is slightly faster (presumably because it is consistent with the 15x300 core layout) but the difference is small and so I will use 180x150 for the new 1deg configs as this is better suited to typical access patterns of reading one hemisphere or the other. The fraction of MOM runtime in oasis_recv values with 90x300 is smaller in the 3 mo case compared to 5yr: 0.067 rather than 0.085 (see prev post). So for 3mo runs the 4-chunk cases (0.067, 0.072) are nearly as fast as the 24-chunk case (0.062) and considerably faster than 1 chunk (0.096) - see post before last. |
For future reference: the processor masking in the ice restarts can be fixed with https://github.com/COSIMA/topogtools/blob/master/fix_ice_restarts.py, allowing a change in processor layout during a run. |
This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there: https://forum.access-hive.org.au/t/payu-generated-symlinks-dont-work-with-parallelio-library/1617/3 |
It may be worth trying to compile with parallel IO using PIO (
setenv IO_TYPE pio
).We currently compile CICE with serial IO (
setenv IO_TYPE netcdf
inbld/build.sh
), so one CPU does all the IO and we end up with an Amdahl's law situation that limits the scalability with large core counts.At 0.1 deg CICE is IO-bound when doing daily outputs (see
Timer 12
inice_diag.d
), and the time spent in CICE IO accounts for almost all the time MOM waits for CICE (oasis_recv
inaccess-om2.out
) so the whole coupled model is waiting on one cpu. With daily CICE output at 0.1deg this is ~19% of the model runtime (it's only ~2% without daily CICE output). Lowering the compression level to 1 (#33) has helped (MOM wait was 23% with level 5), and omitting static field output (#32) would also help.Also I understand that PIO doesn't support compression - is that correct?
@russfiedler had these comments on Slack:
Slack discussion: https://arccss.slack.com/archives/C9Q7Y1400/p1557272377089800
The text was updated successfully, but these errors were encountered: