Skip to content

Retry Logic Overview (WIP)

Jim Borden edited this page Sep 13, 2016 · 5 revisions

This document exists to be the authoritative document on retrying requests over a network. There are quite a few places where this applies during the replication process. This will cover what should happen in the event of both a transient and permanent error. A transient error is one that is expected to pass given a relatively short period of time (such as a connection timeout, or a 503). A permanent error is the opposite (such as a 401 or 404), and is not likely to recover without intervention. This document will not cover other replication logic such as "going offline."


The flow of the replication retry follows:

  1. Replication attempts to start
  2. Replication attempts to continue
  3. A connection error occurs
    • 3a The connection error indicates lack of connectivity, go to 6
    • 3b The connection error is transient, go to 4
    • 3c The connection error is permanent, stop the replication
  4. Retry according to the applied retry strategy (not customizable on all platforms)
    • 4a The retry strategy fails, go to 5
    • 4b The retry strategy succeeds, go to 2
  5. At this point the request in question has failed to send and/or get a response
    • 5a The replication is continuous. Switch to idle, set last error, enter long delay (~60 sec) and go to 1
    • 5b The replication is non-continuous. Set last error, give up and stop the replication
  6. The endpoint is not reachable
    • 6a The device has no network connection. Switch to offline, set last error, and wait for network connection change.
    • 6b The device has a network connection. Switch to offline, set last error, enter long delay (~60 sec) and go to 1


Start non-continuous replication
Initial connection reports 401 (Unauthorized)
Stop replication, callback for error and stopped status (two notifications)

Start non-continuous replication
Halfway through, a 503 error is encountered (Service Unavailable)
Error is transient, so retry
Retry succeeds, replication continues

Start non-continuous replication
Halfway through, a connection time out happens
Error is transient, so retry
Retry failed, replication stops

Start a continuous replication
A 404 error is encountered on the endpoint
Permanent error, so stop the replication

Pseudocode Algorithm

void ErrorEncountered(Exception e)
    if(IsTransient(e)) {
       // 3b
       if(_strategy.CanRetry) {


void HandleErrorEndgame(Exception e)
    if(IsContinuous && IsTransient(e)) {
        // 5a
    if(IsOfflineError(e)) {
        // 3a -> 6b
    else {
        // 3c

Error Judgement

Determining whether an error is connectivity related, permanent, or transient is a big task. This section will accumulate the rules used so far (using .NET for reference).


  • IOException, TimeoutException, TaskCanceledException (this is thrown by the library during async timeouts on HTTP requests) are all considered transient and not analyzed further.
  • SocketException will analyze the socket error code
    • AccessDenied = 10013 = Permanent,
    • AddressAlreadyInUse = 10048 = Permanent,
    • AddressFamilyNotSupported = 10047 = Permanent,
    • AddressNotAvailable = 10049 = Permanent,
    • AlreadyInProgress = 10037 = Transient,
    • ConnectionAborted = 10053 = Transient,
    • ConnectionRefused = 10061 = Connectivity,
    • ConnectionReset = 10054 = Transient,
    • DestinationAddressRequired = 10039 = Permanent,
    • Disconnecting = 10101 = Permanent,
    • Fault = 10014 = Permanent,
    • HostDown = 10064 = Connectivity,
    • HostNotFound = 11001 = Permanent,
    • HostUnreachable = 10065 = Permanent,
    • InProgress = 10036 = Transient,
    • Interrupted = 10004 = Transient,
    • InvalidArgument = 10022 = Permanent,
    • IOPending = 997 = Transient,
    • IsConnected = 10056 = Transient,
    • MessageSize = 10040 = Permanent,
    • NetworkDown = 10050 = Connectivity,
    • NetworkReset = 10052 = Transient,
    • NetworkUnreachable = 10051 = Permanent,
    • NoBufferSpaceAvailable = 10055 = Permanent,
    • NoData = 11004 = Permanent,
    • NoRecovery = 11003 = Permanent,
    • NotConnected = 10057 = Connectivity,
    • NotInitialized = 10093 = Permanent,
    • NotSocket = 10038 = Permanent,
    • OperationAborted = 995 = Transient,
    • OperationNotSupported = 10045 = Permanent,
    • ProcessLimit = 10067 = Transient,
    • ProtocolFamilyNotSupported = 10046 = Permanent,
    • ProtocolNotSupported = 10043 = Permanent,
    • ProtocolOption = 10042 = Permanent,
    • ProtocolType = 10041 = Permanent,
    • Shutdown = 10058 = Transient,
    • SocketError = -1 = Permanent,
    • SocketNotSupported = 10044 = Permanent,
    • SystemNotReady = 10091 = Transient,
    • TimedOut = 10060 = Transient,
    • TooManyOpenSockets = 10024 = Transient,
    • TryAgain = 11002 = Transient,
    • TypeNotFound = 10109 = Permanent,
    • VersionNotSupported = 10092 = Permanent,
    • WouldBlock = 10035 = Transient
  • WebException will analyze the type of failure first
    • ConnectFailure, Timeout, ConnectionClosed, and RequestCanceled are transient
    • Others are considered permanent unless they have an HTTP status code
      • Transient errors are HTTP 408, 500, 502, 503, 504