Different approach to removing RayGetError #3471

ericl · 2018-12-05T01:20:08Z

What do these changes do?

ericl · 2018-12-05T01:21:04Z

Example output:

Traceback (most recent call last):
  File "/home/eric/Desktop/ray-private/python/ray/tune/trial_runner.py", line 261, in _process_events
    result = self.trial_executor.fetch_result(trial)
  File "/home/eric/Desktop/ray-private/python/ray/tune/ray_trial_executor.py", line 211, in fetch_result
    result = ray.get(trial_future[0])
  File "/home/eric/Desktop/ray-private/python/ray/worker.py", line 2293, in get
    raise value
ray.worker.RayTaskError: ray_PPOAgent:train() (pid=31593, host=eric-ThinkPad)
  File "/home/eric/Desktop/ray-private/python/ray/rllib/agents/agent.py", line 324, in train
    result = Trainable.train(self)
  File "/home/eric/Desktop/ray-private/python/ray/tune/trainable.py", line 146, in train
    result = self._train()
  File "/home/eric/Desktop/ray-private/python/ray/rllib/agents/ppo/ppo.py", line 122, in _train
    fetches = self.optimizer.step()
  File "/home/eric/Desktop/ray-private/python/ray/rllib/optimizers/multi_gpu_optimizer.py", line 106, in step
    self.train_batch_size)
  File "/home/eric/Desktop/ray-private/python/ray/rllib/agents/ppo/rollout.py", line 31, in collect_samples
    next_sample = ray.get(fut_sample)
ray.worker.RayTaskError: ray_PolicyEvaluator:sample() (pid=31594, host=eric-ThinkPad)
  File "/home/eric/Desktop/ray-private/python/ray/rllib/evaluation/policy_evaluator.py", line 343, in sample
    batches = [self.sampler.get_data()]
  File "/home/eric/Desktop/ray-private/python/ray/rllib/evaluation/sampler.py", line 69, in get_data
    item = next(self.rollout_provider)
  File "/home/eric/Desktop/ray-private/python/ray/rllib/evaluation/sampler.py", line 282, in _env_runner
    active_episodes, clip_actions)
  File "/home/eric/Desktop/ray-private/python/ray/rllib/evaluation/sampler.py", line 463, in _do_policy_eval
    actions, policy.action_space), rnn_out_cols, pi_info_cols)
  File "/home/eric/Desktop/ray-private/python/ray/rllib/evaluation/sampler.py", line 544, in _clip_actions
    raise ValueError("oops")
ValueError: oops

pcmoritz · 2018-12-05T01:36:11Z

This is really cool!

Here is a self-contained example:

import ray
ray.init()

@ray.remote
def f():
    import blubb
    print('hello')

@ray.remote
def g():
    return ray.get(f.remote())

ray.get(g.remote())

robertnishihara · 2018-12-05T02:54:01Z

The (pid=31593, host=eric-ThinkPad) part is really nice.

dot fix

ericl · 2018-12-05T03:37:09Z

@robertnishihara to address the concern about suppressing worker exceptions, I added a bit of a delay for those. If the driver does not raise a task error before the delay expires, then we go ahead and print out the worker errors. Otherwise, they are suppressed. This should make it so that in the common case the right thing happens:

If there is a false negative suppression, the worst that can happen is the user gets some more error messages.
If there is a false positive, the user does not see some unhandled worker errors (but they will always be seeing some error).

AmplabJenkins · 2018-12-05T04:09:41Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9767/
Test FAILed.

AmplabJenkins · 2018-12-05T04:24:05Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9758/
Test FAILed.

AmplabJenkins · 2018-12-05T06:34:29Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9768/
Test FAILed.

AmplabJenkins · 2018-12-08T08:53:54Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9866/
Test FAILed.

pcmoritz · 2018-12-09T02:15:33Z

@ericl: Can you update test_actor_creation_node_failure, see https://travis-ci.com/ray-project/ray/jobs/163807495

=================================== FAILURES ===================================
_______________________ test_actor_creation_node_failure _______________________
ray_start_cluster = <ray.test.cluster_utils.Cluster object at 0x11ad184d0>
    def test_actor_creation_node_failure(ray_start_cluster):
        # TODO(swang): Refactor test_raylet_failed, etc to reuse the below code.
        cluster = ray_start_cluster
    
        @ray.remote
        class Child(object):
            def __init__(self, death_probability):
                self.death_probability = death_probability
    
            def ping(self):
                # Exit process with some probability.
                exit_chance = np.random.rand()
                if exit_chance < self.death_probability:
                    sys.exit(-1)
    
        num_children = 100
        # Children actors will die about half the time.
        death_probability = 0.5
    
        children = [Child.remote(death_probability) for _ in range(num_children)]
        while len(cluster.list_all_nodes()) > 1:
            for j in range(3):
                # Submit some tasks on the actors. About half of the actors will
                # fail.
                children_out = [child.ping.remote() for child in children]
                # Wait a while for all the tasks to complete. This should trigger
                # reconstruction for any actor creation tasks that were forwarded
                # to nodes that then failed.
                ready, _ = ray.wait(
                    children_out,
                    num_returns=len(children_out),
                    timeout=5 * 60 * 1000)
                assert len(ready) == len(children_out)
    
                # Replace any actors that died.
                for i, out in enumerate(children_out):
                    try:
                        ray.get(out)
>                   except ray.worker.RayGetError:
E                   AttributeError: 'module' object has no attribute 'RayGetError'
test/component_failures_test.py:411: AttributeError
---------------------------- Captured stderr setup -----------------------------

ericl · 2018-12-09T03:10:16Z

Done

pcmoritz

This seems to be working, can be merged when the test is fixed.

AmplabJenkins · 2018-12-09T04:35:28Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9893/
Test FAILed.

AmplabJenkins · 2018-12-09T10:44:53Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9899/
Test FAILed.

ericl · 2018-12-09T22:30:31Z

Somehow this is causing the Raylet to abort, but only in Python 3 builds.

python/ray/tune/test/trial_runner_test.py::TrialRunnerTest::testFailureRecoveryMaxFailures /Users/travis/.travis/job_stages: line 104: 8753 Abort trap: 6 python -m pytest -v python/ray/tune/test/trial_runner_test.py

travis_time:end:0d09cbc8:start=1544353726673758000,finish=1544353983522119000,duration=256848361000
�[0K�[31;1mThe command "python -m pytest -v python/ray/tune/test/trial_runner_test.py" exited with 134.�[0m

AmplabJenkins · 2018-12-10T01:16:21Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9909/
Test FAILed.

AmplabJenkins · 2018-12-13T02:06:53Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9998/
Test PASSed.

robertnishihara · 2018-12-13T07:39:02Z

Possibly fixes #1885.

try 2

fffe562

ericl mentioned this pull request Dec 5, 2018

[WIP] Remove RayGetError and stop throwing from tasks #3224

Closed

fix

1e4496b

auto suppress caught worker errors

5d6ed1b

dot fix

ericl force-pushed the teststack branch from ae446bb to 5d6ed1b Compare December 5, 2018 03:34

Merge remote-tracking branch 'upstream/master' into teststack

01f4096

task error in test

8981bd1

pcmoritz approved these changes Dec 9, 2018

View reviewed changes

Merge remote-tracking branch 'upstream/master' into teststack

91e5545

ignore those python warnings

71ab885

ericl added 2 commits December 12, 2018 14:50

Merge remote-tracking branch 'upstream/master' into teststack

737e262

merge lazy actor del

a67d176

ericl merged commit 0e00533 into ray-project:master Dec 13, 2018

robertnishihara deleted the teststack branch December 13, 2018 07:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different approach to removing RayGetError #3471

Different approach to removing RayGetError #3471

ericl commented Dec 5, 2018

ericl commented Dec 5, 2018

pcmoritz commented Dec 5, 2018

robertnishihara commented Dec 5, 2018

ericl commented Dec 5, 2018 •

edited

Loading

AmplabJenkins commented Dec 5, 2018

AmplabJenkins commented Dec 5, 2018

AmplabJenkins commented Dec 5, 2018

AmplabJenkins commented Dec 8, 2018

pcmoritz commented Dec 9, 2018

ericl commented Dec 9, 2018

pcmoritz left a comment

AmplabJenkins commented Dec 9, 2018

AmplabJenkins commented Dec 9, 2018

ericl commented Dec 9, 2018 •

edited

Loading

AmplabJenkins commented Dec 10, 2018

AmplabJenkins commented Dec 13, 2018

robertnishihara commented Dec 13, 2018

Different approach to removing RayGetError #3471

Different approach to removing RayGetError #3471

Conversation

ericl commented Dec 5, 2018

What do these changes do?

ericl commented Dec 5, 2018

pcmoritz commented Dec 5, 2018

robertnishihara commented Dec 5, 2018

ericl commented Dec 5, 2018 • edited Loading

AmplabJenkins commented Dec 5, 2018

AmplabJenkins commented Dec 5, 2018

AmplabJenkins commented Dec 5, 2018

AmplabJenkins commented Dec 8, 2018

pcmoritz commented Dec 9, 2018

ericl commented Dec 9, 2018

pcmoritz left a comment

Choose a reason for hiding this comment

AmplabJenkins commented Dec 9, 2018

AmplabJenkins commented Dec 9, 2018

ericl commented Dec 9, 2018 • edited Loading

AmplabJenkins commented Dec 10, 2018

AmplabJenkins commented Dec 13, 2018

robertnishihara commented Dec 13, 2018

ericl commented Dec 5, 2018 •

edited

Loading

ericl commented Dec 9, 2018 •

edited

Loading