Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embedded Python interpreter spawns multiple forks of ngen/test_routing_pybind on macOS #505

Open
mattw-nws opened this issue Mar 28, 2023 · 6 comments

Comments

@mattw-nws
Copy link
Contributor

This commit in PR #502 was required to make the t-route integration test pass on the macOS runner. Without it, the test executable recursively forked itself. This was reproduced on a Mac (macOS 13, Python 3.11) and the behavior was the same when trying to run an ngen executable with routing with cpu_pool > 1. This was tested with parallel_compute_method of both by-subnetwork-jit-clustered and by-network. Also, an attempt was made after changing the backend in t-route from loky to threading, but the behavior persisted. This could be related to Python 3.11 or the new PyBind11 version 2.10.

Current behavior

See above. It seems as though joblib is forking the current executable which it expects to be Python, but in this case is not. Or if this is supposed to work, then the executable's main is getting executed instead of whatever other starting point is supposed to begin executing.

Expected behavior

Parallelization in t-route seemed to work previously (ngen branch) without forking multiple test_routing_pybind or ngen processes. This seems to be Mac-specific, Linux is not doing this.

Steps to replicate behavior (include URLs)

  1. Run the code in PR Changes to integrate t-route master branch #502 with contemporary t-route master branch but change the cpu_pool parameter in the test YAML file to 2 or more.

Screenshots

image

@aaraney
Copy link
Member

aaraney commented Mar 29, 2023

This smells familiar. Python 3.8 changed the semantics of new processes creation on macOS. Previously this was accomplished using fork() exec(), but now spawn() is used.

Changed in version 3.8: On macOS, the spawn start method is now the default. The fork start method should be considered unsafe as it can lead to crashes of the subprocess. See bpo-33725.

source

If possible, I think the easiest way to determine if spawn() is the issue is to try this using python 3.7.

@mattw-nws
Copy link
Contributor Author

spawn
... Available on Unix and Windows. The default on Windows and macOS.

fork ...
Available on Unix only. The default on Unix.

Uh, so is macOS a subset of Unix, or not? 😵

@christophertubbs
Copy link
Contributor

MacOS was handled like Unix in the past since it made... you know... sense (like you pointed out), but Apple changed up how processes are handled several years ago (I can't remember if they chucked pthreads out or something). As a result, you had to set the default process creation mode or else the whole thing would/could crash and burn. The devs behind Python later rolled their eyes and changed how it's handled on mac by default so... you know... programs work.

@mattw-nws
Copy link
Contributor Author

I am guessing this caveat also applies to binaries that embed the Python interpreter with pybind11 (or CPython for that matter):

Warning The 'spawn' and 'forkserver' start methods cannot currently be used with “frozen” executables (i.e., binaries produced by packages like PyInstaller and cx_Freeze) on Unix. The 'fork' start method does work.

@mattw-nws
Copy link
Contributor Author

HMM! This is maybe what will need to be done...or part of it at least.
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.set_executable

multiprocessing.set_executable(executable)
Set the path of the Python interpreter to use when starting a child process. (By default sys.executable is used). Embedders will probably need to do some thing like
set_executable(os.path.join(sys.exec_prefix, 'pythonw.exe'))
before they can create child processes.

@mattw-nws
Copy link
Contributor Author

This has cropped up again, so it appears the cpu_pool: 1 workaround has stopped working (?). Going to disable this test on macOS for now, we've already got this issue open to fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants