Closed
Description
System information
- Ubunt LTS 18.04:
- Ray installed from pip:
- Ray version 0.6.4:
- Python version 3.6:
- APEX DDPG:
Describe the problem
APEX DEFAULT CONFIG issue:
config = ddpg.apex.APEX_DDPG_DEFAULT_CONFIG.copy()
agent = ddpg.apex.DDPGAgent(config=config, env="my_env")
Error:
Exception: Unknown config parameter `max_weight_sync_delay`
Source code / logs
Some errors that I really had no idea where came from;
(pid=4879) Fatal Python error: Segmentation fault
(pid=4879)
(pid=4879) Stack (most recent call first):
(pid=4879) File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/optimizers/segment_tree.py", line 92 in __setitem__
(pid=4879) File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/optimizers/replay_buffer.py", line 243 in update_priorities
(pid=4879) File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/optimizers/async_replay_optimizer.py", line 298 in update_priorities
(pid=4879) File "/home/llu/.local/lib/python3.6/site-packages/ray/function_manager.py", line 783 in actor_method_executor
(pid=4879) File "/home/llu/.local/lib/python3.6/site-packages/ray/worker.py", line 860 in _process_task
(pid=4879) File "/home/llu/.local/lib/python3.6/site-packages/ray/worker.py", line 961 in _wait_for_and_process_task
(pid=4879) File "/home/llu/.local/lib/python3.6/site-packages/ray/worker.py", line 1010 in main_loop
(pid=4879) File "/home/llu/.local/lib/python3.6/site-packages/ray/workers/default_worker.py", line 111 in <module>
2019-03-19 12:41:45,240 ERROR worker.py:1752 -- A worker died or was killed while executing task 00000000d8c7c804c2804e2918da7442a7995586.
Traceback (most recent call last):
File "/home/llu/c7_triangle/train_apex_DDPG.py", line 88, in <module>
result = agent.train()
File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/agents/agent.py", line 293, in train
result = Trainable.train(self)
File "/home/llu/.local/lib/python3.6/site-packages/ray/tune/trainable.py", line 150, in train
result = self._train()
File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/agents/dqn/dqn.py", line 258, in _train
self.optimizer.step()
File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/optimizers/async_replay_optimizer.py", line 118, in step
sample_timesteps, train_timesteps = self._step()
File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/optimizers/async_replay_optimizer.py", line 215, in _step
samples = ray.get(replay)
File "/home/llu/.local/lib/python3.6/site-packages/ray/worker.py", line 2288, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
Metadata
Metadata
Assignees
Labels
No labels
Activity
ericl commentedon Mar 19, 2019
Hm if you remove this line, does it fix it?
ray/python/ray/rllib/agents/ddpg/apex.py
Line 26 in e78562b
For the segfault, perhaps try upgrading TensorFlow? (what version are you using)?
aGiant commentedon Mar 20, 2019
I removed "max_weight_sync_delay" and it worked. Tensorflow 1.13.1 was used.
snownus commentedon Sep 10, 2019
@ericl, recently, I came across a similar issue when i call ray model to run frequently. but even remove the "max_weight_sync_delay", the problem still exists. Do you have any other suggestions?
The error is as below:
2019-09-10 20:54:17,363 ERROR worker.py:1606 -- Possible unhandled error from worker: ray_RayModel:restore_from_object() (pid=70761, host=i2r) RayActorError: The actor died unexpectedly before finishing this task.
Apsylem commentedon May 17, 2020
ericl commentedon May 17, 2020
What is the error that caused the actor to die? It's probably somewhere above in the logs, or in /tmp/ray logs.
Apsylem commentedon May 18, 2020
inside tmp/ray/../logs:
some issue with redis seems to be the case:
the individual workers die with something like this
.... W0517 15:52:41.713259 386 reference_count.cc:196] Tried to decrease ref count for nonexistent object ID: d32c701b16435f740a379fa79e7f000800000000 I0517 15:57:16.775820 488 core_worker.cc:567] Node failure 988906d743e7f26672caf90b3b45b3253dea3d35
in redis.out
282:M 17 May 2020 15:52:40.479 # Server initialized 282:M 17 May 2020 15:52:40.479 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect. 282:M 17 May 2020 15:52:40.480 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled. 282:signal-handler (1589731036) Received SIGTERM scheduling shutdown... 282:M 17 May 2020 15:57:16.946 # User requested shutdown... 282:M 17 May 2020 15:57:16.946 # Redis is now ready to exit, bye bye...
in case i got the wrong logs, here are all logs:
logs.zip