Skip to content

APEX DDPG DEFAULT CONFIG issue #4410

Closed
@aGiant

Description

@aGiant

System information

  • Ubunt LTS 18.04:
  • Ray installed from pip:
  • Ray version 0.6.4:
  • Python version 3.6:
  • APEX DDPG:

Describe the problem

APEX DEFAULT CONFIG issue:

config = ddpg.apex.APEX_DDPG_DEFAULT_CONFIG.copy()
agent = ddpg.apex.DDPGAgent(config=config, env="my_env")

Error:

Exception: Unknown config parameter `max_weight_sync_delay` 

Source code / logs

Some errors that I really had no idea where came from;

(pid=4879) Fatal Python error: Segmentation fault
(pid=4879) 
(pid=4879) Stack (most recent call first):
(pid=4879)   File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/optimizers/segment_tree.py", line 92 in __setitem__
(pid=4879)   File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/optimizers/replay_buffer.py", line 243 in update_priorities
(pid=4879)   File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/optimizers/async_replay_optimizer.py", line 298 in update_priorities
(pid=4879)   File "/home/llu/.local/lib/python3.6/site-packages/ray/function_manager.py", line 783 in actor_method_executor
(pid=4879)   File "/home/llu/.local/lib/python3.6/site-packages/ray/worker.py", line 860 in _process_task
(pid=4879)   File "/home/llu/.local/lib/python3.6/site-packages/ray/worker.py", line 961 in _wait_for_and_process_task
(pid=4879)   File "/home/llu/.local/lib/python3.6/site-packages/ray/worker.py", line 1010 in main_loop
(pid=4879)   File "/home/llu/.local/lib/python3.6/site-packages/ray/workers/default_worker.py", line 111 in <module>
2019-03-19 12:41:45,240	ERROR worker.py:1752 -- A worker died or was killed while executing task 00000000d8c7c804c2804e2918da7442a7995586.
Traceback (most recent call last):
  File "/home/llu/c7_triangle/train_apex_DDPG.py", line 88, in <module>
    result = agent.train()
  File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/agents/agent.py", line 293, in train
    result = Trainable.train(self)
  File "/home/llu/.local/lib/python3.6/site-packages/ray/tune/trainable.py", line 150, in train
    result = self._train()
  File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/agents/dqn/dqn.py", line 258, in _train
    self.optimizer.step()
  File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/optimizers/async_replay_optimizer.py", line 118, in step
    sample_timesteps, train_timesteps = self._step()
  File "/home/llu/.local/lib/python3.6/site-packages/ray/rllib/optimizers/async_replay_optimizer.py", line 215, in _step
    samples = ray.get(replay)
  File "/home/llu/.local/lib/python3.6/site-packages/ray/worker.py", line 2288, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

Activity

ericl

ericl commented on Mar 19, 2019

@ericl
Contributor

Hm if you remove this line, does it fix it?

"max_weight_sync_delay": 400,

For the segfault, perhaps try upgrading TensorFlow? (what version are you using)?

aGiant

aGiant commented on Mar 20, 2019

@aGiant
Author

I removed "max_weight_sync_delay" and it worked. Tensorflow 1.13.1 was used.

snownus

snownus commented on Sep 10, 2019

@snownus

@ericl, recently, I came across a similar issue when i call ray model to run frequently. but even remove the "max_weight_sync_delay", the problem still exists. Do you have any other suggestions?

The error is as below:
2019-09-10 20:54:17,363 ERROR worker.py:1606 -- Possible unhandled error from worker: ray_RayModel:restore_from_object() (pid=70761, host=i2r) RayActorError: The actor died unexpectedly before finishing this task.

Apsylem

Apsylem commented on May 17, 2020

@Apsylem

I have the same Issue right here. Some Models just die; without any speaking error.

@ericl, recently, I came across a similar issue when i call ray model to run frequently. but even remove the "max_weight_sync_delay", the problem still exists. Do you have any other suggestions?

The error is as below:
2019-09-10 20:54:17,363 ERROR worker.py:1606 -- Possible unhandled error from worker: ray_RayModel:restore_from_object() (pid=70761, host=i2r) RayActorError: The actor died unexpectedly before finishing this task.

ericl

ericl commented on May 17, 2020

@ericl
Contributor

What is the error that caused the actor to die? It's probably somewhere above in the logs, or in /tmp/ray logs.

Apsylem

Apsylem commented on May 18, 2020

@Apsylem

inside tmp/ray/../logs:

some issue with redis seems to be the case:

the individual workers die with something like this

.... W0517 15:52:41.713259 386 reference_count.cc:196] Tried to decrease ref count for nonexistent object ID: d32c701b16435f740a379fa79e7f000800000000 I0517 15:57:16.775820 488 core_worker.cc:567] Node failure 988906d743e7f26672caf90b3b45b3253dea3d35

in redis.out
282:M 17 May 2020 15:52:40.479 # Server initialized 282:M 17 May 2020 15:52:40.479 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect. 282:M 17 May 2020 15:52:40.480 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled. 282:signal-handler (1589731036) Received SIGTERM scheduling shutdown... 282:M 17 May 2020 15:57:16.946 # User requested shutdown... 282:M 17 May 2020 15:57:16.946 # Redis is now ready to exit, bye bye...

in case i got the wrong logs, here are all logs:
logs.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Participants

    @ericl@snownus@aGiant@Apsylem

    Issue actions

      APEX DDPG DEFAULT CONFIG issue · Issue #4410 · ray-project/ray