Skip to content

RFC: Retry / Timeout #317

@dotansimha

Description

@dotansimha

Note about Traffic Shaping category

Apollo Router allows you to configure all these together with Compression under Traffic Shaping category. But we'll take these seperately as in Hive Gateway JS and Cosmo Router.

Let's seperate Retry and Timeout then combine them later.

Retry

When an executor returns an unexpected error(For HTTP, this is 5xx status code), usually the gateway is expected to retry the request. There are a few parameters for this feature;

Reference from JS GW

  • Max Retries -> In Hive Gateway JS, we expect users to set a maximum number of retries before giving up the request. So the executor is called with that request again and again until the total number of attempts exceed the number of max retries.
  • Retry Delay -> This is a bit different parameter, because it is a fallback number if Retry-After HTTP header is not present. If this header is provided by the subgraph, then that value is used before sending the following request of the attempt, otherwise this given value is used for back-off
  • Exponential Backoff -> To avoid the pressure on subgraphs, the delay for each attempts are increased based on a factor like 1.25.
    Let's say if the first attempt fails, and the retry delay is set to 1 seconds, it sends the next request in a second then if it still fails, now the gateway waits for 1.25 * 1 = 1.25 seconds.
  • Conditional Retry Logic -> In Hive Gateway, it is allowed to set the values above based on the request and/or response paramters. So that, user can have a custom logic on their own for specific needs. But in JS, we allow users to set lambda function like shouldRetry: ({ response }) => response?.status >= 400 // Always retry any HTTP errors So not sure how dynamic expressions would exist in the router.
  • Subgraph-based Logic -> In Hive Gateway JS, you can configure all those for a specific subgraph too
ret:
  global:
    maxRetries: 3
    retryDelay: 1000
    retryFactor: 1.25
  subgraphs:
     products:
       retryDelay: 300

The comparison and proposals for timeout above are based on Hive Gateway JS because they are more detailed compared to Apollo Router. In Apollo Router, retries are an experimental feature and works in a similar way but not documented pretty well in the official docs; https://github.com/apollographql/router/releases/tag/v1.5.0

Cosmo is doing it in a very similar way https://cosmo-docs.wundergraph.com/router/traffic-shaping/retry when it comes to back-off.

We can have hooks in the similar way we have in JS GW to wrap the subgraph execution to apply the timeout logic. Reference impl

Timeout

Timeout feature is fairly less complex compared to retry,but it can become more when you want to combine them.
In both Apollo Router and Hive Gateway JS, you can configure a default timeout to make the execution give up on for all subgraph requests after certain time , or set a timeout for a specific timeout;

upstreamTimeout:
   global: 30_000
   subgraphs:
      products: 10_000

Cosmo has more detailed options which you probably don't need in most cases. Not sure if these are needed at first
https://cosmo-docs.wundergraph.com/router/traffic-shaping/timeout

Combination of these two

In Hive Gateway JS;
If Retry and Timeout are both enabled at the same time (which is recommended), the timeout will be applied to each try. For an overall timeout for all retries, you have to add plugins manually to change their apply order.

upstreamRetry:
   maxRetries: 3
   retryDelay: 1_000
   retryFactor: 1.25
upstreamTimeout:
   global: 10_000

So in that case, after 10 seconds of no response, it will wait for 1 sec, then try again, after another 10 sec, it will wait for 1.25 sec then try again and so on. Apollo Router seems to do this in the same way with their experiemntal retry and timeout combination.

Per operation configuration

Sometimes you might have some slow mutations with high timeout that doesn't apply for the rest of operations.
In that case you might need specific configuration for specific operations.
Then we can provide a VRL expression to decide on the timeout;

timeout:
   expression: |
       if .operation.type == "mutation"
          2_000
       else
          1_000

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions