Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade ray to 1.9.0 #3709

Closed
8 tasks done
shanyu-sys opened this issue Dec 13, 2021 · 6 comments
Closed
8 tasks done

Upgrade ray to 1.9.0 #3709

shanyu-sys opened this issue Dec 13, 2021 · 6 comments
Assignees

Comments

@shanyu-sys
Copy link
Contributor

shanyu-sys commented Dec 13, 2021

@shanyu-sys shanyu-sys self-assigned this Dec 13, 2021
@shanyu-sys
Copy link
Contributor Author

shanyu-sys commented Dec 14, 2021

Install dependencies that need to be upgraded:

  • ray[default] == 1.9.0
  • pandas >= 1.0.5
  • aiohttp == 3.8.1, aync-timeout >= 4
  • numpy == 1.19.5

@shanyu-sys
Copy link
Contributor Author

Note that python 3.7.0 and 3.7.1 may cause Can't pickle function objects error on ray.init (Ray issue)

@shanyu-sys
Copy link
Contributor Author

shanyu-sys commented Dec 14, 2021

Jenkins pre-test for ray 1.8.0:

Orca Ray
http://10.112.231.51:18888/job/ZOO-PR-BigDL-Python-Spark-2.4-py36-ray/237/
http://10.112.231.51:18888/job/ZOO-PR-BigDL-Python-Spark-2.4-py37-ray/226/
http://10.112.231.51:18888/job/ZOO-PR-BigDL-Python-Spark-3.1-py36-ray/242/
http://10.112.231.51:18888/job/ZOO-PR-BigDL-Python-Spark-3.1-py37-ray/246/

http://10.112.231.51:18888/job/ZOO-PR-BigDL-Python-ExampleTests-py37-ray/271/
http://10.112.231.51:18888/job/ZOO-PR-BigDL-Python-ExampleTests-py37-ray-Spark3/286/

Orca Ray Horovod
http://10.112.231.51:18888/job/ZOO-PR-BigDL-Python-Spark-2.4-py37-Horovod/262/
http://10.112.231.51:18888/job/ZOO-PR-BigDL-Python-Spark-3.1-py37-Horovod/289/
http://10.112.231.51:18888/job/ZOO-PR-BigDL-Python-ExampleTests-py37-ray-horovod-Spark3/212/

Chronos Ray
http://10.112.231.51:18888/job/BigDL-PRVN-chronos-Python-Spark-2.4-py36-ray-part1/141/
http://10.112.231.51:18888/job/BigDL-PRVN-chronos-Python-Spark-2.4-py36-ray-part2/88/
http://10.112.231.51:18888/job/BigDL-PRVN-chronos-Python-Spark-2.4-py37-ray-part1/136/
http://10.112.231.51:18888/job/BigDL-PRVN-chronos-Python-Spark-2.4-py37-ray-part2/94/

http://10.112.231.51:18888/job/BigDL-PRVN-chronos-Python-Spark-3.1-py36-ray-part1/139/
http://10.112.231.51:18888/job/BigDL-PRVN-chronos-Python-Spark-3.1-py36-ray-part2/107/
http://10.112.231.51:18888/job/BigDL-PRVN-chronos-Python-Spark-3.1-py37-ray-part1/135/
http://10.112.231.51:18888/job/BigDL-PRVN-chronos-Python-Spark-3.1-py37-ray-part2/99/

Chronos Ray Horovod
http://10.112.231.51:18888/job/BigDL-PRNV-chronos-Python-Spark-2.4-py37-Horovod/136/
http://10.112.231.51:18888/job/BigDL-PRVN-chronos-Python-Spark-3.1-py37-Horovod/150/

@shanyu-sys
Copy link
Contributor Author

In Ray 1.9.0, UT passed but example tests with Ray PyTorch Estimator hang or deadlock, when num_workers!=0 in Pytorch Dataloader.

https://github.com/intel-analytics/BigDL/blob/6adaef18ca21e14acc507822e06b0bc87dc8385c/python/orca/example/learn/pytorch/fashion_mnist/fashion_mnist.py#L50-L51

The detail logs are as below (I used CTRL+C in the end). It seems Ray fails to get result stats back after training 1 epoch.

(PytorchRayWorker pid=25307) INFO:bigdl.orca.learn.pytorch.torch_runner:Finished training epoch 1, stats on rank 0: {'epoch': 1, 'batch_count': 938, 'num_samples': 60000, 'train_loss': 1.8933767430305481, 'last_train_loss': 0.6069512963294983}
^C^CError in sys.excepthook:
Traceback (most recent call last):
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/ray/worker.py", line 1062, in custom_excepthook
    error_message = "".join(traceback.format_tb(tb))
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/traceback.py", line 57, in format_tb
    return extract_tb(tb, limit=limit).format()
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/traceback.py", line 72, in extract_tb
    return StackSummary.extract(walk_tb(tb), limit=limit)
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/traceback.py", line 357, in extract
    filename, lineno, name, lookup_line=False, locals=f_locals))
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/traceback.py", line 243, in __init__
    def __init__(self, filename, lineno, name, *, lookup_line=True,
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/pyspark/context.py", line 270, in signal_handler
    raise KeyboardInterrupt()
KeyboardInterrupt

Original exception was:
Traceback (most recent call last):
  File "/home/shan/sources/BigDL/python/orca/example/learn/pytorch/fashion_mnist/fashion_mnist_tmp.py", line 197, in <module>
    main()
  File "/home/shan/sources/BigDL/python/orca/example/learn/pytorch/fashion_mnist/fashion_mnist_tmp.py", line 181, in main
    stats = orca_estimator.fit(train_data_creator, epochs=epochs, batch_size=batch_size)
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/bigdl/orca/learn/pytorch/pytorch_ray_estimator.py", line 268, in fit
    info=info)
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/bigdl/orca/learn/pytorch/pytorch_ray_estimator.py", line 470, in _train_epochs
    success = check_for_failure(remote_worker_stats)
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/bigdl/orca/learn/pytorch/pytorch_ray_estimator.py", line 51, in check_for_failure
    finished, unfinished = ray.wait(unfinished)
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/ray/worker.py", line 1891, in wait
    fetch_local,
  File "python/ray/_raylet.pyx", line 1367, in ray._raylet.CoreWorker.wait
  File "python/ray/_raylet.pyx", line 154, in ray._raylet.check_status
KeyboardInterrupt
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/ray/node.py", line 987, in _kill_process_type
    wait=wait)
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/ray/node.py", line 1039, in _kill_process_impl
    process.wait(timeout_seconds)
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/subprocess.py", line 1019, in wait
    return self._wait(timeout=timeout)
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/subprocess.py", line 1647, in _wait
    time.sleep(delay)
  File "/home/shan/anaconda3/envs/ray-latest/lib/python3.7/site-packages/pyspark/context.py", line 270, in signal_handler
    raise KeyboardInterrupt()
KeyboardInterrupt
Stopping orca context

Ray 1.8.0 works fine. We will change the Ray upgrade to 1.8.0.

@shanyu-sys shanyu-sys changed the title Upgrade ray to 1.9.0 Upgrade ray to 1.8.0 Dec 15, 2021
@shanyu-sys
Copy link
Contributor Author

There are more critical issues in Ray 1.8.0

  1. Segmentation Fault caused by memory leak. Related issue Chronos distributed forecaster gets core dump with ray 1.8.0 #3742. And the two parameter_server examples in ray_on_spark fail every time.
  2. Ray processes failed to start up, related issue https://github.com/intel-analytics/arda-docker/issues/555, [Bug] Exception: The current node has not been updated within 30 seconds, this could happen because of some of the Ray processes failed to startup. ray-project/ray#19834

@shanyu-sys
Copy link
Contributor Author

@jason-dai
Is that ok if we upgrade to Ray 1.9.0 and we don't allow num_workers > 0 in Pytorch Dataloader for now?
I will try to reproduce in ray and report an issue.
https://github.com/intel-analytics/BigDL/blob/6adaef18ca21e14acc507822e06b0bc87dc8385c/python/orca/example/learn/pytorch/fashion_mnist/fashion_mnist.py#L50-L51

@shanyu-sys shanyu-sys changed the title Upgrade ray to 1.8.0 Upgrade ray to 1.9.0 Dec 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant