Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chronos distributed forecaster gets core dump with ray 1.8.0 #3742

Closed
shanyu-sys opened this issue Dec 15, 2021 · 3 comments
Closed

Chronos distributed forecaster gets core dump with ray 1.8.0 #3742

shanyu-sys opened this issue Dec 15, 2021 · 3 comments

Comments

@shanyu-sys
Copy link
Contributor

With Ray 1.8.0, when chronos/models and chronos/forecaster are in one pytest process, I got Segmentation fault when start testing distributed Lstm Forecaster.

It works fine when I split chronos/models and chronos/forecaster into two processes.

Detail logs:

test/bigdl/chronos/forecaster/test_arima_forecaster.py::TestChronosModelARIMAForecaster::test_arima_forecaster_fit_eval_pred PASSED                    [ 47%]
test/bigdl/chronos/forecaster/test_arima_forecaster.py::TestChronosModelARIMAForecaster::test_arima_forecaster_runtime_error PASSED                    [ 48%]
test/bigdl/chronos/forecaster/test_arima_forecaster.py::TestChronosModelARIMAForecaster::test_arima_forecaster_save_restore PASSED                     [ 50%]
test/bigdl/chronos/forecaster/test_arima_forecaster.py::TestChronosModelARIMAForecaster::test_arima_forecaster_shape_error PASSED                      [ 51%]
test/bigdl/chronos/forecaster/test_lstm_forecaster.py::TestChronosModelLSTMForecaster::test_lstm_forecaster_distributed Fatal Python error: Segmentation fault

Thread 0x00007fc4491fd700 (most recent call first):
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/selectors.py", line 415 in select
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/socketserver.py", line 232 in serve_forever
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/threading.py", line 870 in run
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x00007fc466ffd700 (most recent call first):

Thread 0x00007fc4677fe700 (most recent call first):
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/threading.py", line 300 in wait
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/threading.py", line 552 in wait
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/tqdm/_monitor.py", line 60 in run
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x00007fc467fff700 (most recent call first):
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/multiprocessing/pool.py", line 470 in _handle_results
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/threading.py", line 870 in run
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x00007fc490ff9700 (most recent call first):
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/multiprocessing/pool.py", line 422 in _handle_tasks
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/threading.py", line 870 in run
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x00007fc4917fa700 (most recent call first):
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/multiprocessing/pool.py", line 413 in _handle_workers
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/threading.py", line 870 in run
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x00007fc491ffb700 (most recent call first):
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/multiprocessing/pool.py", line 110 in worker
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/threading.py", line 870 in run
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x00007fc4927fc700 (most recent call first):
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/multiprocessing/pool.py", line 110 in worker
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/threading.py", line 870 in run
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x00007fc492ffd700 (most recent call first):
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/multiprocessing/pool.py", line 110 in worker
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/threading.py", line 870 in run
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x00007fc4937fe700 (most recent call first):
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/multiprocessing/pool.py", line 110 in worker
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/threading.py", line 870 in run
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/threading.py", line 890 in _bootstrap

Current thread 0x00007fc7848eb740 (most recent call first):
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/ray/state.py", line 84 in _really_init_global_state
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/ray/state.py", line 48 in _check_connected
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/ray/state.py", line 226 in node_table
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/ray/_private/services.py", line 272 in wait_for_node
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/ray/node.py", line 265 in __init__
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/ray/worker.py", line 892 in init
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105 in wrapper
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/bigdl/orca/ray/raycontext.py", line 537 in init
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/bigdl/orca/ray/raycontext.py", line 454 in get
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/bigdl/orca/learn/pytorch/pytorch_ray_estimator.py", line 116 in __init__
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/bigdl/orca/learn/pytorch/estimator.py", line 91 in from_torch
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/bigdl/chronos/forecaster/base_forecaster.py", line 52 in __init__
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/bigdl/chronos/forecaster/lstm_forecaster.py", line 127 in __init__
  File "/home/shan/sources/BigDL/python/chronos/test/bigdl/chronos/forecaster/test_lstm_forecaster.py", line 202 in test_lstm_forecaster_distributed
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/unittest/case.py", line 628 in run
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/unittest/case.py", line 676 in __call__
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/_pytest/unittest.py", line 321 in runtest
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/_pytest/runner.py", line 162 in pytest_runtest_call
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/_pytest/runner.py", line 255 in <lambda>
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/_pytest/runner.py", line 311 in from_call
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/_pytest/runner.py", line 255 in call_runtest_hook
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/_pytest/runner.py", line 215 in call_and_report
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/_pytest/runner.py", line 126 in runtestprotocol
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/_pytest/runner.py", line 109 in pytest_runtest_protocol
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/_pytest/main.py", line 348 in pytest_runtestloop
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/_pytest/main.py", line 323 in _main
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/_pytest/main.py", line 269 in wrap_session
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/_pytest/main.py", line 316 in pytest_cmdline_main
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/_pytest/config/__init__.py", line 163 in main
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/_pytest/config/__init__.py", line 185 in console_main
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/pytest/__main__.py", line 5 in <module>
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/runpy.py", line 85 in _run_code
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/runpy.py", line 193 in _run_module_as_main
run-pytests.sh: line 54: 32905 Segmentation fault      (core dumped) python -m pytest -v test/bigdl/chronos/model test/bigdl/chronos/forecaster -k "not test_forecast_tcmf_distributed"

@shanyu-sys
Copy link
Contributor Author

Related PR #3741

@shanyu-sys
Copy link
Contributor Author

It seems the workaround only reduce the frequency for this error.
It is a random failure: http://10.112.231.51:18888/job/BigDL-PRVN-chronos-Python-Spark-2.4-py36-ray-part1/139/consoleFull

@shanyu-sys
Copy link
Contributor Author

It is not a Chronos only issue. And we will directly upgrade ray to 1.9.0.

Close the issue for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants