Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI timing out waiting for child process #774

Open
darcywaller opened this issue May 22, 2024 · 13 comments
Open

MPI timing out waiting for child process #774

darcywaller opened this issue May 22, 2024 · 13 comments
Milestone

Comments

@darcywaller
Copy link

Hi team, I'm encountering an issue where simulation with MPIBackend is getting caught up somewhere when I am trying to simulate dipoles with a Network I adapted (i.e., isn't one of the default networks in hnn-core). MPIBackend is working fine in the same environment and jupyter notebook with the example from documentation and simulating with the custom network also works fine until I try to use MPIBackend. Any advice on troubleshooting? I can't upload an example notebook here but can provide the full code as needed.

Full text of the error message:

/oscar/home/ddiesbur/new_mpi_hnn_core/lib64/python3.9/site-packages/hnn_core/parallel_backends.py:195: UsterWarning: Timeout exceeded while waiting for child process output. Terminating. . .
warn("Timeout exceeded while waiting for child process output."

RuntimeError Traceback (most recent call last)
Cell In[10], line 4
2 with MPIBackend(n_procs=2, mpi_cmd='mpiexec'):
3 print("Running simulation with loaded Failed stop parameters")
----> 4 FS_dpls_yesmpi = simulate_dipole(FS_net, tstop=300, n_trials=2)
6 for dpl in FS_dpls_yesmpi:
7 dpl.scale(125).smooth(30)

File /oscar/home/ddiesbur/new_mpi_hnn_core/lib64/python3.9/site-packages/hnn_core/dipole.py:100, in simulate_dipole(net, tstop, dt, n_trials, record_vsec, record_isec, postproc)
95 if postproc:
96 warnings.warn('The postproc-argument is deprecated and will be removed'
97 ' in a future release of hnn-core. Please define '
98 'smoothing and scaling explicitly using Dipole methods.',
99 DeprecationWarning)
--> 100 dpls = _BACKEND.simulate(net, tstop, dt, n_trials, postproc)
102 return dpls

File /oscar/home/ddiesbur/new_mpi_hnn_core/lib64/python3.9/site-packages/hnn_core/parallel_backends.py:717, in MPIBackend.simulate(self, net, tstop, dt, n_trials, postproc)
712 print(f"MPI will run {n_trials} trial(s) sequentially by "
713 f"distributing network neurons over {self.n_procs} processes.")
715 env = _get_mpi_env()
--> 717 self.proc, sim_data = run_subprocess(
718 command=self.mpi_cmd, obj=[net, tstop, dt, n_trials], timeout=30,
719 proc_queue=self.proc_queue, env=env, cwd=os.getcwd(),
720 universal_newlines=True)
722 dpls = _gather_trial_data(sim_data, net, n_trials, postproc)
723 return dpls

File /oscar/home/ddiesbur/new_mpi_hnn_core/lib64/python3.9/site-packages/hnn_core/parallel_backends.py:233, in run_subprocess(command, obj, timeout, proc_queue, *args, **kwargs)
229 warn("Could not kill python subprocess: PID %d" % proc.pid)
231 if not proc.returncode == 0:
232 # simulation failed with a numeric return code
--> 233 raise RuntimeError("MPI simulation failed. Return code: %d" %
234 proc.returncode)
236 child_data = _process_child_data(proc_data_bytes, data_len)
238 # clean up the queue

RuntimeError: MPI simulation failed. Return code: 143

@rythorpe
Copy link
Contributor

I'm guessing something about your modified Network or the simulated data it's outputting is significantly larger than for one of the default networks?

Try increasing the timeout in _get_data_from_child_err to 0.05. I've had to do this in the past after scaling up the size of the network.

@darcywaller
Copy link
Author

Ah, thanks for the recommendation @rythorpe. It's possible that's the case because I get more spikes and over a longer time period from the frontal ERP models. I tried 0.05 and then 0.1 for that timeout setting but it's still generating the same error, albeit more slowly.

@jasmainak
Copy link
Collaborator

Just to be sure this is not a memory error, could you try on a computer with more RAM?

MPIBackend is notoriously difficult to debug ... could you add print statements to check until where the execution works and at what point it fails?

@darcywaller
Copy link
Author

darcywaller commented May 27, 2024

OK, I have an update on this - thank you both for your recommendations.
I've linked a code snippet here that when run reproduces the error I'm getting (at least in my MPI environment on OSCAR ). @jasmainak, @rythorpe and I tested this a bit in person last week and determined the following:

  • There is not something inherently wrong with the modified Network I am building because it runs without MPI. (Alternatively, there's something wrong that only causes a bug when using MPI and not unthreaded simulation.)
  • This is not a problem unique to my adapted Pyr cells because it also appears to be the case if we extract, say, the function creating pyramidal_ca cells. As soon as we don't replace Network cells manually, it's not a problem. However, the calcium_model() runs fine if we call it from the hnn_core package directly and don't manually set it up in the notebook. Unclear why that is because I don't see anything blatantly wrong when I look at the number of cells and gids in the network prior to simulation.
  • Changing the timeouts doesn't work.
  • Allocating more cores or RAM doesn't work.

After adding print statements to parallel_backends.py and mpi_child.py, I've determined that:

  • The Network object is being sent to backend processes (print statements above sent_network = True in run_subprocess() of parallel_backends.py), but it is apparently not being received by child processes (print statements in the _read_net() function of mpi_child.py b/t lines 92 and 93 DO NOT run).
  • Print statements in the if not_data_received.. loop of the run_subprocess() function in parallel_backends() run indefinitely without entering if data_len>0.., which makes sense because the _simulate_single_trial() function in mpi_child.py is never being reached, so there is never any data, even partial trial sims, to be extracted.

Any recommendations for determining what in the Network object file or Network-related communication in MPI is the problem here? https://gist.github.com/darcywaller/a08389cbae826144a19c87e89d1f3f2d

@jasmainak
Copy link
Collaborator

@darcywaller unfortunately I won't have time to dig into this.

But perhaps you might want to drill down further in _read_net to understand what is happening? How much of the data is received? Try with 1 core first ... just force the MPIBackend with 1 core to understand if that works ... then try with 2 cores to see if both cores get the net object. You can put if conditions with self.rank == 0, self.rank == 1 etc to test what is happening in specific cores.

Under the hood, the net object is serialized (made into a string) using pickle protocol so it can be broadcast to the other cores and then unpickled in the receiving cores. Blake added a string before and after the object "@end_of_net" to recognize that the end of the serialized object ... and extract it using regular expression. Do you get the entire string including "@end_of_net" on the other end? Perhaps something in your new network is interfering with the regular expression from working correctly ... ? You can print out the serialized object etc. ...

@jasmainak
Copy link
Collaborator

jasmainak commented May 28, 2024

Also, as a general comment, it would be helpful to have direct links to the code ... you can click on a line and then click "copy permalink" on github. e.g., _read_net

@gtdang
Copy link
Collaborator

gtdang commented May 29, 2024

It seems like this is the same issue that happening with our GitHub linux runners for pytest. #780

The Ubuntu runners are stalling for about 6 hours before being canceled. The Mac runners are working fine. Oscar uses Redhat, so maybe there's something up with Openmpi and linux right now... I'll check if there have been recent updates to any of our mpi dependencies.

@rythorpe
Copy link
Contributor

I'll try to dig into this soon @darcywaller, but it might be a week or two before I can sit down to debug this properly.

@gtdang I suspect this is a different issue than the one you're referencing because the one @darcywaller encountered still times out. Happy to be wrong though....

@darcywaller
Copy link
Author

@rythorpe No problem, totally understand. I'm on vacation till 6/4 anyway, but am happy to help by starting to try some of @jasmainak's new suggestions when back.

@darcywaller
Copy link
Author

Update - some more troubleshooting determined that MPI was having trouble with unpickling the network, though the pickle and its beginning and end markers were intact. On @ntolley and @rythorpe's suggestion that the partial functions I was using in the network cells might be causing this issue when implementing them in the notebook instead of within hnn-core code, I added the network as a default network on my own hnn-core branch. When importing that network from hnn-core instead, MPI now works, so assumedly that was indeed the issue.

@rythorpe
Copy link
Contributor

Oh nice, glad you got it working. I'm guessing there's a security feature in pickle that allows callables to be unpickled only if they originate within local source code and/or a submodule of the parent library. Perhaps the best fix for this bug would be to remove all callables from cell templates. @jasmainak @ntolley any thoughts? Maybe we can tackle this this week?

@ntolley
Copy link
Contributor

ntolley commented Jul 31, 2024

The callables are somewhat convenient because you'll have to hard-code the conductances for every section otherwise

Since we don't really officially support adding new cells yet, is this really a restriction that we want?

@ntolley ntolley added this to the 0.5 milestone Jul 31, 2024
@rythorpe
Copy link
Contributor

I don't think this is urgent, but the root issue is that this error is thrown when instantiating Network outside of one of the hnn_core submodules (e.g., in network_models.py) regardless of whether or not new cells are added. While this is only relevant for advanced users, it should be possible for someone to instantiate a Network object, add their own connections, and then run it from an arbitrary script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants