Skip to content

Actor reconstruction failure on evolution strategies example. #865

@robertnishihara

Description

@robertnishihara

On 21 machines (m4.16xlarge), I ran

python ray/python/ray/rllib/train.py --alg=EvolutionStrategies --env=Humanoid-v1 --redis-address=<head-node-ip>:6379 --config='{"num_workers": 640, "episodes_per_batch": 10000, "timesteps_per_batch": 100000}'

I let it run for about 12 minutes, and then I ssh'ed to a non-head node and killed the processes with ray stop.

A couple things happened. Looking at the output from the monitor process, the monitor correctly noted the death of the plasma manager by printing

WARNING:root:Removed b'plasma_manager', client ID e314a7d511b89152e4411a7c48cb11643f3135ea

Then after a while (maybe 5 minutes?), it printed

WARNING:root:Marked 53444 objects as lost.

So iterating over the object table and cleaning it up took a long time. Then it noted the death of the local scheduler, marked some tasks as lost, and recreated the lost actors on other local schedulers.

WARNING:root:Removed b'local_scheduler', client ID f9791e072d22aa4cec81e06afb8f4978dc313e2f
WARNING:root:Marked 17816 tasks as lost.
INFO:root:Actor 95930445632215278aeffe0cd5ffeaef530db273 for driver 9e871150f8878ec1a0676b5205e61444a04a904a was on dead local scheduler f9791e
072d22aa4cec81e06afb8f4978dc313e2f. It is being recreated on local scheduler 13aea1e14868be20b802cf4cc8161c4bf536b8ca
INFO:root:Actor 06ddf27dcf0465a99ae4d08f55d3fb31e7c513d1 for driver 9e871150f8878ec1a0676b5205e61444a04a904a was on dead local scheduler f9791e
072d22aa4cec81e06afb8f4978dc313e2f. It is being recreated on local scheduler d70d2c7aaf0333534bdee5c001b56190d319f005
...

After all of that, the actor recreation did not succeed. I saw the following error (printed in the background on the driver).

Traceback (most recent call last):
  File "/home/ubuntu/ray/python/ray/workers/default_worker.py", line 87, in <module>
    ray.worker.global_worker)
  File "/home/ubuntu/ray/python/ray/actor.py", line 271, in reconstruct_actor_state
    assert task_spec_info["ReturnObjectIDs"] == task_spec.returns()
AssertionError

It was printed a ton of times, so it may have happened on all of the newly created actors.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething that is supposed to be working; but isn't

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions