-
Notifications
You must be signed in to change notification settings - Fork 297
Retry Logic Overview (WIP)
This document exists to be the authoritative document on retrying requests over a network. There are quite a few places where this applies during the replication process. This will cover what should happen in the event of both a transient and permanent error. A transient error is one that is expected to pass given a relatively short period of time (such as a connection timeout, or a 503). A permanent error is the opposite (such as a 401 or 404), and is not likely to recover without intervention. This document will not cover other replication logic such as "going offline."
The flow of the replication retry follows:
- Replication attempts to start
- Replication attempts to continue
- A connection error occurs
- 3a The connection error indicates lack of connectivity, go to 6
- 3b The connection error is transient, go to 4
- 3c The connection error is permanent, stop the replication
- Retry according to the applied retry strategy (not customizable on all platforms)
- 4a The retry strategy fails, go to 5
- 4b The retry strategy succeeds, go to 2
- At this point the request in question has failed to send and/or get a response
- 5a The replication is continuous. Switch to idle, set last error, enter long delay (~60 sec) and go to 1
- 5b The replication is non-continuous. Set last error, give up and stop the replication
- The endpoint is not reachable
- 6a The device has no network connection. Switch to offline, set last error, and wait for network connection change.
- 6b The device has a network connection. Switch to offline, set last error, enter long delay (~60 sec) and go to 1
Start non-continuous replication
Initial connection reports 401 (Unauthorized)
Stop replication, callback for error and stopped status (two notifications)
Start non-continuous replication
Halfway through, a 503 error is encountered (Service Unavailable)
Error is transient, so retry
Retry succeeds, replication continues
Start non-continuous replication
Halfway through, a connection time out happens
Error is transient, so retry
Retry failed, replication stops
Start a continuous replication
A 404 error is encountered on the endpoint
Permanent error, so stop the replication
void ErrorEncountered(Exception e)
{
if(IsTransient(e)) {
// 3b
if(_strategy.CanRetry) {
_strategy.Retry();
return:
}
}
HandleErrorEndgame(e);
}
void HandleErrorEndgame(Exception e)
{
if(IsContinuous && IsTransient(e)) {
// 5a
EnterRetryLoop();
return;
}
if(IsOfflineError(e)) {
// 3a -> 6b
EnterOfflineLoop();
else {
// 3c
StopReplication();
}
}
Determining whether an error is connectivity related, permanent, or transient is a big task. This section will accumulate the rules used so far (using .NET for reference).
-
IOException
,TimeoutException
,TaskCanceledException
(this is thrown by the library during async timeouts on HTTP requests) are all considered transient and not analyzed further. -
SocketException
will analyze the socket error code- AccessDenied = 10013 = Permanent,
- AddressAlreadyInUse = 10048 = Permanent,
- AddressFamilyNotSupported = 10047 = Permanent,
- AddressNotAvailable = 10049 = Permanent,
- AlreadyInProgress = 10037 = Transient,
- ConnectionAborted = 10053 = Transient,
- ConnectionRefused = 10061 = Connectivity,
- ConnectionReset = 10054 = Transient,
- DestinationAddressRequired = 10039 = Permanent,
- Disconnecting = 10101 = Permanent,
- Fault = 10014 = Permanent,
- HostDown = 10064 = Connectivity,
- HostNotFound = 11001 = Permanent,
- HostUnreachable = 10065 = Permanent,
- InProgress = 10036 = Transient,
- Interrupted = 10004 = Transient,
- InvalidArgument = 10022 = Permanent,
- IOPending = 997 = Transient,
- IsConnected = 10056 = Transient,
- MessageSize = 10040 = Permanent,
- NetworkDown = 10050 = Connectivity,
- NetworkReset = 10052 = Transient,
- NetworkUnreachable = 10051 = Permanent,
- NoBufferSpaceAvailable = 10055 = Permanent,
- NoData = 11004 = Permanent,
- NoRecovery = 11003 = Permanent,
- NotConnected = 10057 = Connectivity,
- NotInitialized = 10093 = Permanent,
- NotSocket = 10038 = Permanent,
- OperationAborted = 995 = Transient,
- OperationNotSupported = 10045 = Permanent,
- ProcessLimit = 10067 = Transient,
- ProtocolFamilyNotSupported = 10046 = Permanent,
- ProtocolNotSupported = 10043 = Permanent,
- ProtocolOption = 10042 = Permanent,
- ProtocolType = 10041 = Permanent,
- Shutdown = 10058 = Transient,
- SocketError = -1 = Permanent,
- SocketNotSupported = 10044 = Permanent,
- SystemNotReady = 10091 = Transient,
- TimedOut = 10060 = Transient,
- TooManyOpenSockets = 10024 = Transient,
- TryAgain = 11002 = Transient,
- TypeNotFound = 10109 = Permanent,
- VersionNotSupported = 10092 = Permanent,
- WouldBlock = 10035 = Transient
-
WebException
will analyze the type of failure first-
ConnectFailure
,Timeout
,ConnectionClosed
, andRequestCanceled
are transient - Others are considered permanent unless they have an HTTP status code
- Transient errors are HTTP 408, 500, 502, 503, 504
-