Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime_env not using the right env? #17086

Closed
simon-mo opened this issue Jul 14, 2021 · 6 comments · Fixed by #17101
Closed

runtime_env not using the right env? #17086

simon-mo opened this issue Jul 14, 2021 · 6 comments · Fixed by #17101
Assignees
Labels
bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@simon-mo
Copy link
Contributor

simon-mo commented Jul 14, 2021

➜  /tmp ipython
Python 3.7.7 (default, May  6 2020, 04:59:01)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.25.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import ray
   ...:
   ...: ray.init()
   ...:
   ...: @ray.remote
   ...: class TensorflowWorker:
   ...:     def __init__(self):
   ...:         import tensorflow as tf
   ...:         print("my tf version is", tf.__version__)
   ...:         self.version = tf.__version__
   ...:
   ...:     def get_version(self):
   ...:         return self.version
   ...:
   ...:     def call_other(self, other_worker):
   ...:         tensor = tf.ones([1])
   ...:         print(self.version, "made a tensor", tensor)
   ...:         computed = ray.get(other_worker.add_one.remote())
   ...:         print(self.version, "got result tensor", computed)
   ...:
   ...:     def add_one(self, tensor):
   ...:         print(self.version, "adding one to tensor", tensor)
   ...:         return tensor + 1
   ...:
   ...: tf1 = TensorflowWorker.options(runtime_env={"pip": ["tensorflow==1.15"]}).remote()
   ...: tf2 = TensorflowWorker.options(runtime_env={"pip": ["tensorflow==2.5"]}).remote()
   ...:
   ...: print(ray.get(tf1.get_version.remote()))
   ...: print(ray.get(tf2.get_version.remote()))
   ...:
/Users/simonmo/miniconda3/envs/anyscale/lib/python3.7/site-packages/ray/autoscaler/_private/cli_logger.py:61: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command.
  "update your install command.", FutureWarning)
2021-07-14 13:51:05,382	INFO services.py:1274 -- View the Ray dashboard at http://127.0.0.1:8265
(raylet)
(raylet) InvalidArchiveError('Error with archive /Users/simonmo/miniconda3/pkgs/pip-21.1.3-py37hecd8cb5_0.conda.  You probably need to delete and re-download or re-create this file.  Message from libarchive was:\n\nFile is not a zip file',)
(raylet)
(raylet) /Users/simonmo/miniconda3/envs/anyscale/lib/python3.7/site-packages/ray/autoscaler/_private/cli_logger.py:61: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command.
(raylet)   "update your install command.", FutureWarning)
(raylet) Traceback (most recent call last):
(raylet)   File "/Users/simonmo/miniconda3/envs/anyscale/lib/python3.7/site-packages/ray/workers/setup_worker.py", line 18, in <module>
(raylet)     setup(remaining_args)
(raylet)   File "/Users/simonmo/miniconda3/envs/anyscale/lib/python3.7/site-packages/ray/workers/setup_runtime_env.py", line 76, in setup
(raylet)     conda_yaml_path, conda_dir)
(raylet)   File "/Users/simonmo/miniconda3/envs/anyscale/lib/python3.7/site-packages/ray/_private/conda.py", line 109, in get_or_create_conda_env
(raylet)     stream_output=True)
(raylet)   File "/Users/simonmo/miniconda3/envs/anyscale/lib/python3.7/site-packages/ray/_private/conda.py", line 153, in exec_cmd
(raylet)     raise ShellCommandException("Non-zero exitcode: %s" % (exit_code))
(raylet) ray._private.conda.ShellCommandException: Non-zero exitcode: 1
2021-07-14 13:51:25,650	WARNING worker.py:1123 -- The actor or task with ID ffffffffffffffff168ba8598d223c2dd05569b701000000 cannot be scheduled right now. It requires {CPU: 1.000000} for placement, but this node only has remaining {16.000000/16.000000 CPU, 26.890025 GiB/26.890025 GiB memory, 13.445012 GiB/13.445012 GiB object_store_memory, 1.000000/1.000000 node:192.168.1.69}
. In total there are 0 pending tasks and 2 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this task or actor because it takes time to install.
(pid=92585) my tf version is 1.15.0
(pid=92602) my tf version is 1.15.0
1.15.0
1.15.0
@simon-mo simon-mo added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 14, 2021
@simon-mo
Copy link
Contributor Author

simon-mo commented Jul 14, 2021

script

import ray

ray.init()

@ray.remote
class TensorflowWorker:
    def __init__(self):
        import tensorflow as tf
        print("my tf version is", tf.__version__)
        self.version = tf.__version__

    def get_version(self):
        return self.version

    def call_other(self, other_worker):
        tensor = tf.ones([1])
        print(self.version, "made a tensor", tensor)
        computed = ray.get(other_worker.add_one.remote())
        print(self.version, "got result tensor", computed)

    def add_one(self, tensor):
        print(self.version, "adding one to tensor", tensor)
        return tensor + 1

tf1 = TensorflowWorker.options(runtime_env={"pip": ["tensorflow==1.15"]}).remote()
tf2 = TensorflowWorker.options(runtime_env={"pip": ["tensorflow==2.5"]}).remote()

print(ray.get(tf1.get_version.remote()))
print(ray.get(tf2.get_version.remote()))

@simon-mo
Copy link
Contributor Author

this is using 1.4.1, checking master

@architkulkarni
Copy link
Contributor

architkulkarni commented Jul 14, 2021

It's probably easier to check on the latest nightly rather than on master, because master requires you to manually include the wheel link in the runtime env. I'll check the latest nightly.

I just ran it on 1.4.1 on Mac OS and got the following (after deleting unrelated messages):

2021-07-14 14:08:57,268 WARNING worker.py:1123 -- The actor or task with ID ffffffffffffffff07b7e87159cbfa2785c75cf301000000 cannot be scheduled right now. It requires {CPU: 1.000000} for placement, but this node only has remaining {16.000000/16.000000 CPU, 3.924236 GiB/3.924236 GiB memory, 1.962118 GiB/1.962118 GiB object_store_memory, 1.000000/1.000000 node:192.168.1.112}
. In total there are 0 pending tasks and 2 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this task or actor because it takes time to install.
(raylet) Pip subprocess error:
(raylet) Could not find platform independent libraries <prefix>
(raylet) Could not find platform dependent libraries <exec_prefix>
(raylet) Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>]
(raylet) Python path configuration:
(raylet)   PYTHONHOME = (not set)
(raylet)   PYTHONPATH = (not set)
(raylet)   program name = '/tmp/ray/session_2021-07-14_14-08-35_108230_90238/runtime_resources/conda/ray-e4579be93106f8e2b654da9e3abb9b0fd28b64a1/bin/python'
(raylet)   isolated = 0
(raylet)   environment = 1
(raylet)   user site = 1
(raylet)   import site = 1
(raylet)   sys._base_executable = '/tmp/ray/session_2021-07-14_14-08-35_108230_90238/runtime_resources/conda/ray-e4579be93106f8e2b654da9e3abb9b0fd28b64a1/bin/python'
(raylet)   sys.base_prefix = '/tmp/ray/session_2021-07-14_14-08-35_108230_90238/runtime_resources/conda/ray-e4579be93106f8e2b654da9e3abb9b0fd28b64a1'
(raylet)   sys.base_exec_prefix = '/tmp/ray/session_2021-07-14_14-08-35_108230_90238/runtime_resources/conda/ray-e4579be93106f8e2b654da9e3abb9b0fd28b64a1'
(raylet)   sys.executable = '/tmp/ray/session_2021-07-14_14-08-35_108230_90238/runtime_resources/conda/ray-e4579be93106f8e2b654da9e3abb9b0fd28b64a1/bin/python'
(raylet)   sys.prefix = '/tmp/ray/session_2021-07-14_14-08-35_108230_90238/runtime_resources/conda/ray-e4579be93106f8e2b654da9e3abb9b0fd28b64a1'
(raylet)   sys.exec_prefix = '/tmp/ray/session_2021-07-14_14-08-35_108230_90238/runtime_resources/conda/ray-e4579be93106f8e2b654da9e3abb9b0fd28b64a1'
(raylet)   sys.path = [
(raylet)     '/tmp/ray/session_2021-07-14_14-08-35_108230_90238/runtime_resources/conda/ray-e4579be93106f8e2b654da9e3abb9b0fd28b64a1/lib/python38.zip',
(raylet)     '/tmp/ray/session_2021-07-14_14-08-35_108230_90238/runtime_resources/conda/ray-e4579be93106f8e2b654da9e3abb9b0fd28b64a1/lib/python3.8',
(raylet)     '/tmp/ray/session_2021-07-14_14-08-35_108230_90238/runtime_resources/conda/ray-e4579be93106f8e2b654da9e3abb9b0fd28b64a1/lib/lib-dynload',
(raylet)   ]
(raylet) Fatal Python error: init_fs_encoding: failed to get the Python codec of the filesystem encoding
(raylet) Python runtime state: core initialized
(raylet) ModuleNotFoundError: No module named 'encodings'
(raylet) 
(raylet) Current thread 0x0000000111f00e00 (most recent call first):
(raylet) <no Python frame>
(raylet) 
(raylet) 
(raylet) CondaEnvException: Pip failed
(raylet) 
(raylet) Traceback (most recent call last):
(raylet)   File "/Users/archit/anaconda3/envs/ray141/lib/python3.8/site-packages/ray/workers/setup_worker.py", line 18, in <module>
(raylet)     setup(remaining_args)
(raylet)   File "/Users/archit/anaconda3/envs/ray141/lib/python3.8/site-packages/ray/workers/setup_runtime_env.py", line 75, in setup
(raylet)     conda_env_name = get_or_create_conda_env(
(raylet)   File "/Users/archit/anaconda3/envs/ray141/lib/python3.8/site-packages/ray/_private/conda.py", line 104, in get_or_create_conda_env
(raylet)     exec_cmd(
(raylet)   File "/Users/archit/anaconda3/envs/ray141/lib/python3.8/site-packages/ray/_private/conda.py", line 153, in exec_cmd
(raylet)     raise ShellCommandException("Non-zero exitcode: %s" % (exit_code))
(raylet) ray._private.conda.ShellCommandException: Non-zero exitcode: 1
(raylet) 
(raylet) AssertionError('Prefix record insertion error: a record with name ca-certificates already exists in the prefix. This is a bug in conda. Please report it at https://github.com/conda/conda/issues')
(raylet) ()
(raylet) 
(raylet) Traceback (most recent call last):
(raylet)   File "/Users/archit/anaconda3/envs/ray141/lib/python3.8/site-packages/ray/workers/setup_worker.py", line 18, in <module>
(raylet)     setup(remaining_args)
(raylet)   File "/Users/archit/anaconda3/envs/ray141/lib/python3.8/site-packages/ray/workers/setup_runtime_env.py", line 75, in setup
(raylet)     conda_env_name = get_or_create_conda_env(
(raylet)   File "/Users/archit/anaconda3/envs/ray141/lib/python3.8/site-packages/ray/_private/conda.py", line 104, in get_or_create_conda_env
(raylet)     exec_cmd(
(raylet)   File "/Users/archit/anaconda3/envs/ray141/lib/python3.8/site-packages/ray/_private/conda.py", line 153, in exec_cmd
(raylet)     raise ShellCommandException("Non-zero exitcode: %s" % (exit_code))
(raylet) ray._private.conda.ShellCommandException: Non-zero exitcode: 1
(raylet) 
(raylet) 
(raylet) Pip subprocess error:
(raylet) Traceback (most recent call last):
(raylet)   File "/tmp/ray/session_2021-07-14_14-08-35_108230_90238/runtime_resources/conda/ray-b9f4aeb5cc4425213e1b94e2cde12760fd04a379/lib/python3.8/runpy.py", line 194, in _run_module_as_main
(raylet)   File "/tmp/ray/session_2021-07-14_14-08-35_108230_90238/runtime_resources/conda/ray-b9f4aeb5cc4425213e1b94e2cde12760fd04a379/lib/python3.8/runpy.py", line 87, in _run_code
(raylet)   File "/tmp/ray/session_2021-07-14_14-08-35_108230_90238/runtime_resources/conda/ray-b9f4aeb5cc4425213e1b94e2cde12760fd04a379/lib/python3.8/site-packages/pip/__main__.py", line 29, in <module>
(raylet) ModuleNotFoundError: No module named 'pip._internal.cli'
(raylet) 
(raylet) 
(raylet) CondaEnvException: Pip failed
(raylet) 
(raylet) Traceback (most recent call last):
(raylet)   File "/Users/archit/anaconda3/envs/ray141/lib/python3.8/site-packages/ray/workers/setup_worker.py", line 18, in <module>
(raylet)     setup(remaining_args)
(raylet)   File "/Users/archit/anaconda3/envs/ray141/lib/python3.8/site-packages/ray/workers/setup_runtime_env.py", line 75, in setup
(raylet)     conda_env_name = get_or_create_conda_env(
(raylet)   File "/Users/archit/anaconda3/envs/ray141/lib/python3.8/site-packages/ray/_private/conda.py", line 104, in get_or_create_conda_env
(raylet)     exec_cmd(
(raylet)   File "/Users/archit/anaconda3/envs/ray141/lib/python3.8/site-packages/ray/_private/conda.py", line 153, in exec_cmd
(raylet)     raise ShellCommandException("Non-zero exitcode: %s" % (exit_code))
(raylet) ray._private.conda.ShellCommandException: Non-zero exitcode: 1
(raylet) 
(raylet) AssertionError('Prefix record insertion error: a record with name ca-certificates already exists in the prefix. This is a bug in conda. Please report it at https://github.com/conda/conda/issues')
(raylet) ()
(raylet) 
(raylet) Traceback (most recent call last):
(raylet)   File "/Users/archit/anaconda3/envs/ray141/lib/python3.8/site-packages/ray/workers/setup_worker.py", line 18, in <module>
(raylet)     setup(remaining_args)
(raylet)   File "/Users/archit/anaconda3/envs/ray141/lib/python3.8/site-packages/ray/workers/setup_runtime_env.py", line 75, in setup
(raylet)     conda_env_name = get_or_create_conda_env(
(raylet)   File "/Users/archit/anaconda3/envs/ray141/lib/python3.8/site-packages/ray/_private/conda.py", line 104, in get_or_create_conda_env
(raylet)     exec_cmd(
(raylet)   File "/Users/archit/anaconda3/envs/ray141/lib/python3.8/site-packages/ray/_private/conda.py", line 153, in exec_cmd
(raylet)     raise ShellCommandException("Non-zero exitcode: %s" % (exit_code))
(raylet) ray._private.conda.ShellCommandException: Non-zero exitcode: 1
(raylet)   File "/Users/archit/anaconda3/envs/ray141/lib/python3.8/site-packages/ray/workers/default_worker.py", line 107
(raylet)     f"{ray_constants.LOGGING_ROTATE_BYTES} bytes.")
(raylet)     ^
(raylet) SyntaxError: invalid syntax
(raylet)   File "/Users/archit/anaconda3/envs/ray141/lib/python3.8/site-packages/ray/workers/default_worker.py", line 107
(raylet)     f"{ray_constants.LOGGING_ROTATE_BYTES} bytes.")
(raylet)     ^
(raylet) SyntaxError: invalid syntax

The "invalid syntax" error continues forever, so it seems like Python 2 is being used somehow?

@simon-mo
Copy link
Contributor Author

error on master

2021-07-14 14:21:13,020	INFO services.py:1247 -- View the Ray dashboard at http://127.0.0.1:8265
(raylet)
(raylet) [Errno 2] No such file or directory: '/Users/simonmo/miniconda3/pkgs/libffi-3.2.1-h0a44026_1007.conda'
(raylet)
(raylet)
(raylet) CondaValueError: prefix already exists: /tmp/ray/session_2021-07-14_14-21-10_483685_21471/runtime_resources/conda/ray-0242ea1e6d00588fc78266212fa116656d02b625
(raylet)
(raylet)
(raylet) CondaValueError: prefix already exists: /tmp/ray/session_2021-07-14_14-21-10_483685_21471/runtime_resources/conda/ray-0242ea1e6d00588fc78266212fa116656d02b625
(raylet)
(raylet) Traceback (most recent call last):
(raylet)   File "/Users/simonmo/Desktop/ray/ray/python/ray/workers/default_worker.py", line 8, in <module>
(raylet)     import ray
(raylet) ModuleNotFoundError: No module named 'ray'
(raylet) [2021-07-14 14:21:28,621 E 21484 7063843] agent_manager.cc:122: Failed to create runtime env: {"_ray_commit": "{{RAY_COMMIT_SHA}}", "conda": null, "env_vars": null, "pip": "tensorflow==1.15\n", "working_dir": null}, error message: Non-zero exitcode: 1
(raylet) [2021-07-14 14:21:29,761 E 21484 7063843] agent_manager.cc:122: Failed to create runtime env: {"_ray_commit": "{{RAY_COMMIT_SHA}}", "conda": null, "env_vars": null, "pip": "tensorflow==1.15\n", "working_dir": null}, error message: Non-zero exitcode: 1
2021-07-14 14:21:33,290	WARNING worker.py:1127 -- The actor or task with ID ffffffffffffffff25cc7b8bef95b549f2c5cdb601000000 cannot be scheduled right now. It requires {CPU: 1.000000} for placement, but this node only has remaining {16.000000/16.000000 CPU, 25.913592 GiB/25.913592 GiB memory, 12.956796 GiB/12.956796 GiB object_store_memory, 1.000000/1.000000 node:192.168.1.69}
. In total there are 0 pending tasks and 2 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this task or actor because it takes time to install.
(raylet) Traceback (most recent call last):
(raylet)   File "/Users/simonmo/Desktop/ray/ray/python/ray/workers/default_worker.py", line 8, in <module>
(raylet)     import ray
(raylet) ModuleNotFoundError: No module named 'ray'
(raylet) Traceback (most recent call last):
(raylet)   File "/Users/simonmo/Desktop/ray/ray/python/ray/workers/default_worker.py", line 8, in <module>
(raylet)     import ray
(raylet) ModuleNotFoundError: No module named 'ray'

@architkulkarni
Copy link
Contributor

Just tested on the nightly. If you make the installs non-concurrent like this:

tf1 = TensorflowWorker.options(runtime_env={"pip": ["tensorflow==1.15"]}).remote()
print(ray.get(tf1.get_version.remote()))
tf2 = TensorflowWorker.options(runtime_env={"pip": ["tensorflow==2.5"]}).remote()
print(ray.get(tf2.get_version.remote()))

then it succeeds. However, if you use the original script, the concurrent installs cause an issue:

(raylet) 
(raylet) 
(raylet) ERROR conda.core.link:_execute(700): An error occurred while installing package 'defaults::pip-21.1.3-py36hecd8cb5_0'.
(raylet) Pip subprocess error:
(raylet) Traceback (most recent call last):
(raylet)   File "/tmp/ray/session_2021-07-14_15-19-58_268634_3871/runtime_resources/conda/ray-28b9a5cc80f9bf272b4b1427b712ebdc49d213b1/lib/python3.6/runpy.py", line 193, in _run_module_as_main
(raylet)   File "/tmp/ray/session_2021-07-14_15-19-58_268634_3871/runtime_resources/conda/ray-28b9a5cc80f9bf272b4b1427b712ebdc49d213b1/lib/python3.6/runpy.py", line 85, in _run_code
(raylet)   File "/tmp/ray/session_2021-07-14_15-19-58_268634_3871/runtime_resources/conda/ray-28b9a5cc80f9bf272b4b1427b712ebdc49d213b1/lib/python3.6/site-packages/pip/__main__.py", line 29, in <module>
(raylet)     from pip._internal.cli.main import main as _main
(raylet)   File "/tmp/ray/session_2021-07-14_15-19-58_268634_3871/runtime_resources/conda/ray-28b9a5cc80f9bf272b4b1427b712ebdc49d213b1/lib/python3.6/site-packages/pip/_internal/cli/main.py", line 3, in <module>
(raylet)     import locale
(raylet) ModuleNotFoundError: No module named 'locale'
(raylet) 
(raylet) 
(raylet) CondaEnvException: Pip failed
(raylet) 
In [2]: 2021-07-14 15:20:19,472 WARNING worker.py:1127 -- The actor or task with ID ffffffffffffffff3f2291c41bdc24333da34b4901000000 cannot be scheduled right now. It requires {CPU: 1.000000} for placement, but this node only has remaining {16.000000/16.000000 CPU, 4.571990 GiB/4.571990 GiB memory, 2.285995 GiB/2.285995 GiB object_store_memory, 1.000000/1.000000 node:192.168.1.112}
. In total there are 0 pending tasks and 2 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this task or actor because it takes time to install.
(raylet) 
(raylet) CondaError: Cannot link a source that does not exist. /tmp/ray/session_2021-07-14_15-19-58_268634_3871/runtime_resources/conda/ray-28b9a5cc80f9bf272b4b1427b712ebdc49d213b1/.condatmp/6857f80b-0e48-4b62-a198-c9db0f252b56
(raylet) Running `conda clean --packages` may resolve your problem.
(raylet) ()
(raylet) 
(raylet)   File "/Users/archit/anaconda3/envs/nightly36/lib/python3.6/site-packages/ray/workers/default_worker.py", line 107
(raylet)     f"{ray_constants.LOGGING_ROTATE_BYTES} bytes.")
(raylet)     ^
(raylet) SyntaxError: invalid syntax
(raylet) [2021-07-14 15:20:23,816 E 3914 29991] agent_manager.cc:122: Failed to create runtime env: {"_ray_commit": "0f79ebbd750882840113a243035066d4f8f7bea1", "conda": null, "env_vars": null, "pip": "tensorflow==2.5\n", "working_dir": null}, error message: Non-zero exitcode: 1
(raylet) 
(raylet)   File "/Users/archit/anaconda3/envs/nightly36/lib/python3.6/site-packages/ray/workers/default_worker.py", line 107
(raylet)     f"{ray_constants.LOGGING_ROTATE_BYTES} bytes.")
(raylet)     ^
(raylet) SyntaxError: invalid syntax

There's probably an issue with how we're using FileLock to prevent concurrent installs.

@architkulkarni architkulkarni self-assigned this Jul 14, 2021
@architkulkarni
Copy link
Contributor

According to https://stackoverflow.com/questions/58210335/is-conda-install-a-thread-safe-operation conda installs should not be run concurrently, even if you are installing different conda environments. Currently the lock we use is per-env; I'll change it to a global lock for all conda installs and see if that fixes the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants