Closed
Description
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu
- Ray installed from (source or binary): source
- Ray version: 0.6.1
- Python version: 3.6.3
Starting the object store with 50GB on an m5.4xlarge instance
In [1]: import ray
ray
In [2]: ray.init(object_store_memory_mb=50*1000)
WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
Process STDOUT and STDERR is being redirected to /tmp/ray/session_2018-12-31_23-58-25_28453/logs.
Waiting for redis server at 127.0.0.1:14322 to respond...
Waiting for redis server at 127.0.0.1:24609 to respond...
WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 33008.238592MB available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
Starting the Plasma object store with 50.0GB memory using /tmp.
E1231 23:58:25.641736 28607 io.cc:167] Connection to IPC socket failed for pathname /tmp/ray/session_2018-12-31_23-58-25_28453/sockets/plasma_store, retrying 50 more times
doesn't immediately fail because we think that the machine has 66GB memory
In [3]: ray.utils.get_system_memory_bytes() // 10**9
Out[3]: 66
However, the plasma store fails to start because the way we check memory in Arrow appears to think we only have 42GB. Note that I'm passing in -d /tmp
.
~$ ray/build/external/arrow-install/bin/plasma_store_server -s /tmp/store -m 50000000000 -d /tmp
I0101 00:02:24.788489 28722 store.cc:994] Allowing the Plasma store to use up to 50GB of memory.
I0101 00:02:24.788723 28722 store.cc:1024] Starting object store with directory /tmp and huge page support disabled
F0101 00:02:24.788743 28722 store.cc:1039] System memory request exceeds memory available in /tmp. The request is for 50000000000 bytes, and the amount available is 42490683392 bytes. You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
*** Check failure stack trace: ***
@ 0x44212c google::LogMessage::Fail()
@ 0x442070 google::LogMessage::SendToLog()
@ 0x4419b2 google::LogMessage::Flush()
@ 0x4417ad google::LogMessage::~LogMessage()
@ 0x43e5e0 arrow::util::ArrowLog::~ArrowLog()
@ 0x415b04 main
@ 0x7f5039fa4830 __libc_start_main
@ 0x415f09 _start
@ (nil) (unknown)
Aborted (core dumped)
The relevant code for checking memory in the plasma store is https://github.com/apache/arrow/blob/71ccba9b217a7af922d8a69be21ed4db205af741/cpp/src/plasma/store.cc#L1028-L1037. The issue may be that we're checking shared memory size instead of regular memory.
Note that the actual failure raised by ray.init
is
E1231 23:58:30.547828 28607 io.cc:167] Connection to IPC socket failed for pathname /tmp/ray/session_2018-12-31_23-58-25_28453/sockets/plasma_store, retrying 1 more times
E1231 23:58:30.549942 28453 io.cc:167] Connection to IPC socket failed for pathname /tmp/ray/session_2018-12-31_23-58-25_28453/sockets/plasma_store, retrying 1 more times
F1231 23:58:30.647966 28607 object_store_notification_manager.cc:22] Check failed: _s.ok() Bad status: IOError: Could not connect to socket /tmp/ray/session_2018-12-31_23-58-25_28453/sockets/plasma_store
*** Check failure stack trace: ***
@ 0x5d7cd0 google::LogMessage::Fail()
@ 0x5d7c14 google::LogMessage::SendToLog()
@ 0x5d7556 google::LogMessage::Flush()
@ 0x5d7351 google::LogMessage::~LogMessage()
@ 0x5c5c50 arrow::util::ArrowLog::~ArrowLog()
@ 0x5770e7 ray::ObjectStoreNotificationManager::ObjectStoreNotificationManager()
@ 0x526cfa ray::ObjectManager::ObjectManager()
@ 0x4c0e67 ray::raylet::Raylet::Raylet()
@ 0x4ae3c9 main
@ 0x7fc2edc6d830 __libc_start_main
@ 0x4b3d19 _start
@ (nil) (unknown)
---------------------------------------------------------------------------
ArrowIOError Traceback (most recent call last)
<ipython-input-2-f09714301df8> in <module>()
----> 1 ray.init(object_store_memory_mb=50*1000)
~/ray/python/ray/worker.py in init(redis_address, num_cpus, num_gpus, resources, object_store_memory, object_store_memory_mb, redis_max_memory, redis_max_memory_mb, collect_profiling_data, node_ip_address, object_id_seed, num_workers, local_mode, driver_mode, redirect_worker_output, redirect_output, ignore_reinit_error, num_redis_shards, redis_max_clients, redis_password, plasma_directory, huge_pages, include_webui, driver_id, configure_logging, logging_level, logging_format, plasma_store_socket_name, raylet_socket_name, temp_dir, _internal_config, use_raylet)
1619 _internal_config=_internal_config,
1620 )
-> 1621 ret = _init(ray_params, driver_id=driver_id)
1622 for hook in _post_init_hooks:
1623 hook()
~/ray/python/ray/worker.py in _init(ray_params, driver_id)
1428 mode=ray_params.driver_mode,
1429 worker=global_worker,
-> 1430 driver_id=driver_id)
1431 return ray_params.address_info
1432
~/ray/python/ray/worker.py in connect(ray_params, info, mode, worker, driver_id)
1975 # Create an object store client.
1976 worker.plasma_client = thread_safe_client(
-> 1977 plasma.connect(info["store_socket_name"]))
1978
1979 raylet_socket = info["raylet_socket_name"]
~/ray/python/ray/pyarrow_files/pyarrow/_plasma.pyx in pyarrow._plasma.connect()
~/ray/python/ray/pyarrow_files/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowIOError: Could not connect to socket /tmp/ray/session_2018-12-31_23-58-25_28453/sockets/plasma_store