-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI timing out waiting for child process #774
Comments
I'm guessing something about your modified Try increasing the timeout in |
Ah, thanks for the recommendation @rythorpe. It's possible that's the case because I get more spikes and over a longer time period from the frontal ERP models. I tried 0.05 and then 0.1 for that timeout setting but it's still generating the same error, albeit more slowly. |
Just to be sure this is not a memory error, could you try on a computer with more RAM? MPIBackend is notoriously difficult to debug ... could you add |
OK, I have an update on this - thank you both for your recommendations.
After adding print statements to parallel_backends.py and mpi_child.py, I've determined that:
Any recommendations for determining what in the Network object file or Network-related communication in MPI is the problem here? https://gist.github.com/darcywaller/a08389cbae826144a19c87e89d1f3f2d |
@darcywaller unfortunately I won't have time to dig into this. But perhaps you might want to drill down further in Under the hood, the |
Also, as a general comment, it would be helpful to have direct links to the code ... you can click on a line and then click "copy permalink" on github. e.g., _read_net |
It seems like this is the same issue that happening with our GitHub linux runners for pytest. #780 The Ubuntu runners are stalling for about 6 hours before being canceled. The Mac runners are working fine. Oscar uses Redhat, so maybe there's something up with Openmpi and linux right now... I'll check if there have been recent updates to any of our mpi dependencies. |
I'll try to dig into this soon @darcywaller, but it might be a week or two before I can sit down to debug this properly. @gtdang I suspect this is a different issue than the one you're referencing because the one @darcywaller encountered still times out. Happy to be wrong though.... |
@rythorpe No problem, totally understand. I'm on vacation till 6/4 anyway, but am happy to help by starting to try some of @jasmainak's new suggestions when back. |
Update - some more troubleshooting determined that MPI was having trouble with unpickling the network, though the pickle and its beginning and end markers were intact. On @ntolley and @rythorpe's suggestion that the partial functions I was using in the network cells might be causing this issue when implementing them in the notebook instead of within hnn-core code, I added the network as a default network on my own hnn-core branch. When importing that network from hnn-core instead, MPI now works, so assumedly that was indeed the issue. |
Oh nice, glad you got it working. I'm guessing there's a security feature in |
The callables are somewhat convenient because you'll have to hard-code the conductances for every section otherwise Since we don't really officially support adding new cells yet, is this really a restriction that we want? |
I don't think this is urgent, but the root issue is that this error is thrown when instantiating |
Hi team, I'm encountering an issue where simulation with MPIBackend is getting caught up somewhere when I am trying to simulate dipoles with a Network I adapted (i.e., isn't one of the default networks in hnn-core). MPIBackend is working fine in the same environment and jupyter notebook with the example from documentation and simulating with the custom network also works fine until I try to use MPIBackend. Any advice on troubleshooting? I can't upload an example notebook here but can provide the full code as needed.
Full text of the error message:
/oscar/home/ddiesbur/new_mpi_hnn_core/lib64/python3.9/site-packages/hnn_core/parallel_backends.py:195: UsterWarning: Timeout exceeded while waiting for child process output. Terminating. . .
warn("Timeout exceeded while waiting for child process output."
RuntimeError Traceback (most recent call last)
Cell In[10], line 4
2 with MPIBackend(n_procs=2, mpi_cmd='mpiexec'):
3 print("Running simulation with loaded Failed stop parameters")
----> 4 FS_dpls_yesmpi = simulate_dipole(FS_net, tstop=300, n_trials=2)
6 for dpl in FS_dpls_yesmpi:
7 dpl.scale(125).smooth(30)
File /oscar/home/ddiesbur/new_mpi_hnn_core/lib64/python3.9/site-packages/hnn_core/dipole.py:100, in simulate_dipole(net, tstop, dt, n_trials, record_vsec, record_isec, postproc)
95 if postproc:
96 warnings.warn('The postproc-argument is deprecated and will be removed'
97 ' in a future release of hnn-core. Please define '
98 'smoothing and scaling explicitly using Dipole methods.',
99 DeprecationWarning)
--> 100 dpls = _BACKEND.simulate(net, tstop, dt, n_trials, postproc)
102 return dpls
File /oscar/home/ddiesbur/new_mpi_hnn_core/lib64/python3.9/site-packages/hnn_core/parallel_backends.py:717, in MPIBackend.simulate(self, net, tstop, dt, n_trials, postproc)
712 print(f"MPI will run {n_trials} trial(s) sequentially by "
713 f"distributing network neurons over {self.n_procs} processes.")
715 env = _get_mpi_env()
--> 717 self.proc, sim_data = run_subprocess(
718 command=self.mpi_cmd, obj=[net, tstop, dt, n_trials], timeout=30,
719 proc_queue=self.proc_queue, env=env, cwd=os.getcwd(),
720 universal_newlines=True)
722 dpls = _gather_trial_data(sim_data, net, n_trials, postproc)
723 return dpls
File /oscar/home/ddiesbur/new_mpi_hnn_core/lib64/python3.9/site-packages/hnn_core/parallel_backends.py:233, in run_subprocess(command, obj, timeout, proc_queue, *args, **kwargs)
229 warn("Could not kill python subprocess: PID %d" % proc.pid)
231 if not proc.returncode == 0:
232 # simulation failed with a numeric return code
--> 233 raise RuntimeError("MPI simulation failed. Return code: %d" %
234 proc.returncode)
236 child_data = _process_child_data(proc_data_bytes, data_len)
238 # clean up the queue
RuntimeError: MPI simulation failed. Return code: 143
The text was updated successfully, but these errors were encountered: