-
Notifications
You must be signed in to change notification settings - Fork 249
Updates to validation and docs for distributed simulations #3429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@glwagner and @simone-silvestri Continuing the discussion. It looks like MPI needs to be initialized before accessing its communication variables. Thus, despite calling Oceananigans.jl/validation/distributed_simulations/distributed_nonhydrostatic_turbulence.jl Lines 20 to 22 in 2946027
Should we initialize MPI when importing DistributedComputations? |
in my opinion no, it's fine to have an The other option to not use MPI variables is to hardcode the number of processors |
validation/distributed_simulations/distributed_nonhydrostatic_turbulence.jl
Outdated
Show resolved
Hide resolved
I am still working on that and the PR is in draft mode, but I will let you know when it is ready for review. |
…turbulence.jl Co-authored-by: Simone Silvestri <silvestri.simone0@gmail.com>
I just pushed changes for the geostrophic adjustment case. The code seems alright, but it returns the following warning:
Then it returns the following error: Error output:
Any idea what it could be, @simone-silvestri ? |
validation/distributed_simulations/distributed_geostrophic_adjustment.jl
Outdated
Show resolved
Hide resolved
…ustment.jl Co-authored-by: Simone Silvestri <silvestri.simone0@gmail.com>
validation/distributed_simulations/distributed_geostrophic_adjustment.jl
Outdated
Show resolved
Hide resolved
…ustment.jl Co-authored-by: Simone Silvestri <silvestri.simone0@gmail.com>
validation/distributed_simulations/distributed_geostrophic_adjustment.jl
Outdated
Show resolved
Hide resolved
…ustment.jl Co-authored-by: Simone Silvestri <silvestri.simone0@gmail.com>
Either 1) explicitly invoke |
This advice seems to contradict the goal of having scripts that can be changed from non-distributed to distributed with a single line. I don't think user scripts should have to write |
Just to be clear: our goal is to develop an API for distributed simulations that require a "minimum" of changes to the same script applied to a non-distributed simulation. The point of this goal is to make it easy to scale simulations from single-process to multi-process. |
Does that make sense to initialize MPI when importing DistributedComputations? |
I don't want to initialize MPI when importing the module because you might import the module but not use MPI routines. If you want to explicitly use MPI routines in the script (as the example above), it makes sense you initialize MPI and not hide the initialization somewhere else. If you want you can introduce an API for calculating the rank and the size that internally initializes MPI, although I don't really deem it necessary to have our own take on MPI as MPI is already very standardized. |
Just to clarify, this is already possible, just by doing arch = Distributed()
rank = arch.local_rank The script, on the other hand, explicitly uses the MPI module before constructing the architecture, for which it makes sense to have |
But why do we need to calculate the rank and size? That's the root issue. I'm not suggesting that we build an API to emit the rank and size. I'm suggesting that we design the code so that kind of calculation is not necessary. Consider: rank and size do not exist for non-distributed simulations. So no matter how it is done, if your code needs to calculate the rank and size, then it is not "agnostic" to being distributed vs serial. That is distributed-explicit code. Make it very clear: the objective is to provide a codebase that permits distributed simulations with minimal code changes to serial scripts. If you find that users are constantly writing distributed-specific code, then a rethink is needed. Agree that we should not initialize when importing the module! It doesn't make sense, |
Maybe the best is to hardcode the number of ranks instead of using |
That's related with #3374 and the warning comes from Julia v1.10 (wasn't there with Julia v1.9). |
@iuryt this works already if you don't want to hardcore/import the MPI module. I would leave the validation case like this though |
Distributed IO is not supported yet and will require a lot of work and refactor so for the time being we have to do with the rank in the output writers |
We need output features that don't require manually inserting the rank into the filename. |
Distributed IO is not necessary to solve this problem. We can introduce a convention for filenames whereby the rank number is automatically inserted into the filename for distributed simulations. Similarly the convention can be reversed for There are probably plenty of other possible solutions. The first step is recognizing that a problem or deficiency exists, and having the desire to fix it. Then we can do the fun part, which is to design solutions to an important problem. Note that distributed IO is unlikely to ever be supported for JLD2. So to provide a seamless experience with JLD2, we really do require a filename convention to solve the problem. (For NetCDF, we could envision providing a choice between file splitting or distributed IO.) Filename conventions also seem preferred to me for many cases in a world of GPUs where each rank likely needs to hold a substantial portion of the computation for performance. Splitting files by rank will limit the filesize, which makes it easier to transfer data when the simulations are very large --- which will probably often be the case when doing distributed simulations. |
Sure, that is just how we are doing it now (the difference would be just inserting the rank behind the scenes). @iuryt if you want to have a go at it in this PR, that should be quite simple to implement (the rank is held in the architecture in This API "problem" does not exhaust the IO issue though. The problem I was referring to is having split files. I still believe that distributed IO is necessary to have a fully functioning distributed code. |
I will work on that |
What's the problem with split files? Combining data into JLD2 files is not always possible is it? It would require the entirety of the output field to fit into the memory of one node. It also may not be efficient. Correct me if you think otherwise, but it seems we will always want to support split output, even if we also support combined output for distributed simulations. |
Sorry @glwagner |
I was responding to @simone-silvestri ! In terms of user scripts, I certainly do think it's better to access the However, I think that the output writers should automatically change the filename when the simulation is distributed. This should be pretty easy and just involves adding the suffix We should also have an API for callbacks that are intended to only run on one rank (for example for printing stuff); eg a property There's probably other useful things. |
I liked the idea of automatically add the suffix. I can work on that too. This is becoming and interesting PR. For the callbacks, the idea is that Then we need to make sure the simulation check the rank from I am not sure I understand the last paragraph. I understand the part of including the meta information about the processor, but not sure about the rest. |
@iuryt can we do it in a new PR? I can get it started to illustrate, and you can help me by refining the implementation and getting the tests to pass. What do you think? |
As this new PR will change this current PR, I will work on the new PR first and then come back here. |
Perhaps we should also update this docstring in this PR? Oceananigans.jl/src/DistributedComputations/distributed_architectures.jl Lines 174 to 209 in 075d42c
The docstring function signature does not mirror the actual signature + some explanation on |
Definitely fix if its not accurate |
@glwagner |
Up to you but it seems like you will save yourself some effort if you first update the user interface, and then update the validation scripts? |
Is it so hard to update the validation scripts too? Hopefully that should be easy and it doesn't really matter what you do first. We use the validation scripts to test the user interface. You'll be changing them no matter what, in either case. |
validation/distributed_simulations/distributed_nonhydrostatic_turbulence.jl
Show resolved
Hide resolved
…turbulence.jl Co-authored-by: Simone Silvestri <silvestri.simone0@gmail.com>
Sorry for taking that long to come back here. |
@simone-silvestri can probably contribute a simple script (maybe something that adapts a baroclinic adjustment case) for |
The GPU validation should be identical. You just need to
The rest of the script does not change |
We still need actual, explicit code. That way you don't have to make this comment "it doesn't change" every time. |
@iuryt can I bring this PR to completion? I think we just need to update and bring it to speed with the new oceannigans version |
This PR just updates two validation tests right |
Sorry for not giving updates. Let me know if you need any help. I can also run the examples from my side to test it. |
I think this is stale but reopen if needed |
From #2788