RFC: Retry / Timeout

### Note about Traffic Shaping category
Apollo Router allows you to configure all these together with [Compression](https://github.com/graphql-hive/router/issues/315) under Traffic Shaping category. But we'll take these seperately as in Hive Gateway JS and Cosmo Router.

Let's seperate Retry and Timeout then combine them later.

### Retry
When an executor returns an unexpected error(For HTTP, this is 5xx status code), usually the gateway is expected to retry the request. There are a few parameters for this feature;

[Reference from JS GW](https://the-guild.dev/graphql/hive/docs/gateway/other-features/upstream-reliability#retry-mechanism)

- **Max Retries** -> In Hive Gateway JS, we expect users to set a maximum number of retries before giving up the request. So the executor is called with that request again and again until the total number of attempts exceed the number of `max retries`.
- **Retry Delay** -> This is a bit different parameter, because it is a fallback number if `Retry-After` HTTP header is not present. If this header is provided by the subgraph, then that value is used before sending the following request of the attempt, otherwise this given value is used for back-off
- **Exponential Backoff** -> To avoid the pressure on subgraphs, the delay for each attempts are increased based on a factor like 1.25.
Let's say if the first attempt fails, and the retry delay is set to 1 seconds, it sends the next request in a second then if it still fails, now the gateway waits for 1.25 * 1 = 1.25 seconds.
- **Conditional Retry Logic** -> In Hive Gateway, it is allowed to set the values above based on the request and/or response paramters. So that, user can have a custom logic on their own for specific needs. But in JS, we allow users to set lambda function like `shouldRetry: ({ response }) => response?.status >= 400 // Always retry any HTTP errors` So not sure how dynamic expressions would exist in the router.
- **Subgraph-based Logic** -> In Hive Gateway JS, you can configure all those for a specific subgraph too

```yaml
ret:
  global:
    maxRetries: 3
    retryDelay: 1000
    retryFactor: 1.25
  subgraphs:
     products:
       retryDelay: 300
```

> The comparison and proposals for timeout above are based on Hive Gateway JS because they are more detailed compared to Apollo Router. In Apollo Router, retries are an experimental feature and works in a similar way but not documented pretty well in the official docs; https://github.com/apollographql/router/releases/tag/v1.5.0

> Cosmo is doing it in a very similar way https://cosmo-docs.wundergraph.com/router/traffic-shaping/retry when it comes to back-off.

We can have hooks in the similar way we have in JS GW to wrap the subgraph execution to apply the timeout logic. [Reference impl](https://github.com/graphql-hive/gateway/blob/main/packages/runtime/src/plugins/useUpstreamTimeout.ts)

### Timeout
Timeout feature is fairly less complex compared to retry,but it can become more when you want to combine them.
In both Apollo Router and Hive Gateway JS, you can configure a default timeout to make the execution give up on for all subgraph requests after certain time , or set a timeout for a specific timeout;

```yaml
upstreamTimeout:
   global: 30_000
   subgraphs:
      products: 10_000
```

> Cosmo has more detailed options which you probably don't need in most cases. Not sure if these are needed at first 
https://cosmo-docs.wundergraph.com/router/traffic-shaping/timeout

### Combination of these two
In Hive Gateway JS;
If Retry and Timeout are both enabled at the same time (which is recommended), the timeout will be applied to each try. For an overall timeout for all retries, you have to add plugins manually to change their apply order.
```yaml
upstreamRetry:
   maxRetries: 3
   retryDelay: 1_000
   retryFactor: 1.25
upstreamTimeout:
   global: 10_000
```
So in that case, after 10 seconds of no response, it will wait for 1 sec, then try again, after another 10 sec, it will wait for 1.25 sec then try again and so on. Apollo Router seems to do this in the same way with their experiemntal retry and timeout combination.

### Per operation configuration
Sometimes you might have some slow mutations with high timeout that doesn't apply for the rest of operations.
In that case you might need specific configuration for specific operations.
Then we can provide a VRL expression to decide on the timeout;

```yaml
timeout:
   expression: |
       if .operation.type == "mutation"
          2_000
       else
          1_000
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

RFC: Retry / Timeout #317

Note about Traffic Shaping category

Retry

Timeout

Combination of these two

Per operation configuration

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

RFC: Retry / Timeout #317

Description

Note about Traffic Shaping category

Retry

Timeout

Combination of these two

Per operation configuration

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions