Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement exponential backoff retry mechanism for transport tasks #1837

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Implement exponential backoff retry mechanism for transport tasks
JobProcesses have various tasks the need to execute that require
a transport, which can then fail for various reasons due to the
command executed over the transport excepting. Examples are the
submission of a job calculation as well as updating its scheduler
state. These may fail for reasons that do not necessarily mean that
the job is unrecoverably lost, such as the internet connection being
temporarily unavailable or the scheduler simply not responding.
Instead of putting the process in an excepted state, the engine
should automatically retry at a later stage.

Here we implement the exponential_backoff_retry utility, which is a
coroutine that can wrap another function or coroutine and will try
to run it, and rerun it when an exception is caught. When an
exception is caught as many times as the maximum number of allowed
attempts, the exception is reraised.

This is implemented in the various transport tasks that are called
by the Waiting state of the JobProcess class:

 * task_submit_job: submit the calculation
 * task_update_job: update the scheduler state
 * task_retrieve_job: retrieve the files of the completed calc
 * task_kill_job: kill the job through the scheduler

These are now wrapped in the exponential_backoff_retry coroutine,
which will give the process some leeway when they fail for reasons
that may often resolve themselves, when given the time.
  • Loading branch information
sphuber committed Aug 2, 2018
commit 3bd18b4f9161a39c70c60e2c83dcdc4896b93a28
2 changes: 1 addition & 1 deletion aiida/backends/tests/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@
'work.run': ['aiida.backends.tests.work.run'],
'work.runners': ['aiida.backends.tests.work.test_runners'],
'work.test_transport': ['aiida.backends.tests.work.test_transport'],
'work.utils': ['aiida.backends.tests.work.utils'],
'work.utils': ['aiida.backends.tests.work.test_utils'],
'work.work_chain': ['aiida.backends.tests.work.work_chain'],
'work.workfunctions': ['aiida.backends.tests.work.test_workfunctions'],
'work.job_processes': ['aiida.backends.tests.work.job_processes'],
Expand Down
55 changes: 55 additions & 0 deletions aiida/backends/tests/work/test_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# -*- coding: utf-8 -*-
from tornado.ioloop import IOLoop
from tornado.gen import coroutine

from aiida.backends.testbase import AiidaTestCase
from aiida.work.utils import exponential_backoff_retry

ITERATION = 0
MAX_ITERATIONS = 3


class TestExponentialBackoffRetry(AiidaTestCase):
"""Tests for the exponential backoff retry coroutine."""

@classmethod
def setUpClass(cls, *args, **kwargs):
"""Set up a simple authinfo and for later use."""
super(TestExponentialBackoffRetry, cls).setUpClass(*args, **kwargs)
cls.authinfo = cls.backend.authinfos.create(
computer=cls.computer,
user=cls.backend.users.get_automatic_user())
cls.authinfo.store()

def test_exponential_backoff_success(self):
"""Test that exponential backoff will successfully catch exceptions as long as max_attempts is not exceeded."""
ITERATION = 0
loop = IOLoop()

@coroutine
def coro():
"""A function that will raise RuntimeError as long as ITERATION is smaller than MAX_ITERATIONS."""
global ITERATION
ITERATION += 1
if ITERATION < MAX_ITERATIONS:
raise RuntimeError

max_attempts = MAX_ITERATIONS + 1
loop.run_sync(lambda: exponential_backoff_retry(coro, initial_interval=0.1, max_attempts=max_attempts))

def test_exponential_backoff_max_attempts_exceeded(self):
"""Test that exponential backoff will finally raise if max_attempts is exceeded"""
ITERATION = 0
loop = IOLoop()

@coroutine
def coro():
"""A function that will raise RuntimeError as long as ITERATION is smaller than MAX_ITERATIONS."""
global ITERATION
ITERATION += 1
if ITERATION < MAX_ITERATIONS:
raise RuntimeError

max_attempts = MAX_ITERATIONS - 1
with self.assertRaises(RuntimeError):
loop.run_sync(lambda: exponential_backoff_retry(coro, initial_interval=0.1, max_attempts=max_attempts))
Loading