Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different approach to removing RayGetError #3471

Merged
merged 9 commits into from
Dec 13, 2018
Merged

Conversation

ericl
Copy link
Contributor

@ericl ericl commented Dec 5, 2018

What do these changes do?

Revise #3224

@ericl
Copy link
Contributor Author

ericl commented Dec 5, 2018

Example output:

Traceback (most recent call last):
  File "/home/eric/Desktop/ray-private/python/ray/tune/trial_runner.py", line 261, in _process_events
    result = self.trial_executor.fetch_result(trial)
  File "/home/eric/Desktop/ray-private/python/ray/tune/ray_trial_executor.py", line 211, in fetch_result
    result = ray.get(trial_future[0])
  File "/home/eric/Desktop/ray-private/python/ray/worker.py", line 2293, in get
    raise value
ray.worker.RayTaskError: ray_PPOAgent:train() (pid=31593, host=eric-ThinkPad)
  File "/home/eric/Desktop/ray-private/python/ray/rllib/agents/agent.py", line 324, in train
    result = Trainable.train(self)
  File "/home/eric/Desktop/ray-private/python/ray/tune/trainable.py", line 146, in train
    result = self._train()
  File "/home/eric/Desktop/ray-private/python/ray/rllib/agents/ppo/ppo.py", line 122, in _train
    fetches = self.optimizer.step()
  File "/home/eric/Desktop/ray-private/python/ray/rllib/optimizers/multi_gpu_optimizer.py", line 106, in step
    self.train_batch_size)
  File "/home/eric/Desktop/ray-private/python/ray/rllib/agents/ppo/rollout.py", line 31, in collect_samples
    next_sample = ray.get(fut_sample)
ray.worker.RayTaskError: ray_PolicyEvaluator:sample() (pid=31594, host=eric-ThinkPad)
  File "/home/eric/Desktop/ray-private/python/ray/rllib/evaluation/policy_evaluator.py", line 343, in sample
    batches = [self.sampler.get_data()]
  File "/home/eric/Desktop/ray-private/python/ray/rllib/evaluation/sampler.py", line 69, in get_data
    item = next(self.rollout_provider)
  File "/home/eric/Desktop/ray-private/python/ray/rllib/evaluation/sampler.py", line 282, in _env_runner
    active_episodes, clip_actions)
  File "/home/eric/Desktop/ray-private/python/ray/rllib/evaluation/sampler.py", line 463, in _do_policy_eval
    actions, policy.action_space), rnn_out_cols, pi_info_cols)
  File "/home/eric/Desktop/ray-private/python/ray/rllib/evaluation/sampler.py", line 544, in _clip_actions
    raise ValueError("oops")
ValueError: oops

@pcmoritz
Copy link
Contributor

pcmoritz commented Dec 5, 2018

This is really cool!

Here is a self-contained example:

import ray
ray.init()

@ray.remote
def f():
    import blubb
    print('hello')

@ray.remote
def g():
    return ray.get(f.remote())

ray.get(g.remote())

@robertnishihara
Copy link
Collaborator

The (pid=31593, host=eric-ThinkPad) part is really nice.

@ericl
Copy link
Contributor Author

ericl commented Dec 5, 2018

@robertnishihara to address the concern about suppressing worker exceptions, I added a bit of a delay for those. If the driver does not raise a task error before the delay expires, then we go ahead and print out the worker errors. Otherwise, they are suppressed. This should make it so that in the common case the right thing happens:

  • If there is a false negative suppression, the worst that can happen is the user gets some more error messages.
  • If there is a false positive, the user does not see some unhandled worker errors (but they will always be seeing some error).

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9767/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9758/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9768/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9866/
Test FAILed.

@pcmoritz
Copy link
Contributor

pcmoritz commented Dec 9, 2018

@ericl: Can you update test_actor_creation_node_failure, see https://travis-ci.com/ray-project/ray/jobs/163807495

=================================== FAILURES ===================================
_______________________ test_actor_creation_node_failure _______________________
ray_start_cluster = <ray.test.cluster_utils.Cluster object at 0x11ad184d0>
    def test_actor_creation_node_failure(ray_start_cluster):
        # TODO(swang): Refactor test_raylet_failed, etc to reuse the below code.
        cluster = ray_start_cluster
    
        @ray.remote
        class Child(object):
            def __init__(self, death_probability):
                self.death_probability = death_probability
    
            def ping(self):
                # Exit process with some probability.
                exit_chance = np.random.rand()
                if exit_chance < self.death_probability:
                    sys.exit(-1)
    
        num_children = 100
        # Children actors will die about half the time.
        death_probability = 0.5
    
        children = [Child.remote(death_probability) for _ in range(num_children)]
        while len(cluster.list_all_nodes()) > 1:
            for j in range(3):
                # Submit some tasks on the actors. About half of the actors will
                # fail.
                children_out = [child.ping.remote() for child in children]
                # Wait a while for all the tasks to complete. This should trigger
                # reconstruction for any actor creation tasks that were forwarded
                # to nodes that then failed.
                ready, _ = ray.wait(
                    children_out,
                    num_returns=len(children_out),
                    timeout=5 * 60 * 1000)
                assert len(ready) == len(children_out)
    
                # Replace any actors that died.
                for i, out in enumerate(children_out):
                    try:
                        ray.get(out)
>                   except ray.worker.RayGetError:
E                   AttributeError: 'module' object has no attribute 'RayGetError'
test/component_failures_test.py:411: AttributeError
---------------------------- Captured stderr setup -----------------------------

@ericl
Copy link
Contributor Author

ericl commented Dec 9, 2018

Done

Copy link
Contributor

@pcmoritz pcmoritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be working, can be merged when the test is fixed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9893/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9899/
Test FAILed.

@ericl
Copy link
Contributor Author

ericl commented Dec 9, 2018

Somehow this is causing the Raylet to abort, but only in Python 3 builds.

python/ray/tune/test/trial_runner_test.py::TrialRunnerTest::testFailureRecoveryMaxFailures /Users/travis/.travis/job_stages: line 104: 8753 Abort trap: 6 python -m pytest -v python/ray/tune/test/trial_runner_test.py

travis_time:end:0d09cbc8:start=1544353726673758000,finish=1544353983522119000,duration=256848361000
�[0K�[31;1mThe command "python -m pytest -v python/ray/tune/test/trial_runner_test.py" exited with 134.�[0m

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9909/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9998/
Test PASSed.

@ericl ericl merged commit 0e00533 into ray-project:master Dec 13, 2018
@robertnishihara
Copy link
Collaborator

Possibly fixes #1885.

@robertnishihara robertnishihara deleted the teststack branch December 13, 2018 07:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants