Metrics by error type when distributer fail to call ingester

**Is your feature request related to a problem? Please describe.**
When using cortex, we sometimes have calls between distributor and ingester that might fails. We do have [metrics](https://github.com/cortexproject/cortex/blob/master/pkg/distributor/distributor.go#L820) for call made to ingester as well as [metrics](https://github.com/cortexproject/cortex/blob/master/pkg/distributor/distributor.go#L822) for failed responses. These failed responses collect any type of error. Some of these errors as 4xx does not show an issue on the system while errors like 5xx do show that something unexpected is happening with ingester. This could lead us to have a bad ingester without actually knowing it, because ingesterAppendFailures metric does not mean we have an issue. This become a bigger issue when a second ingester is being restarted or goes down. We thought that we were being redundant running with 3 ingester, but actually we were already running with limit capacity 2 out of 3.
It would be of good value to have a differentiation between which error we are receiving from ingester.


**Describe the solution you'd like**
The solution proposed would be adding a new label to the existent [metric](https://github.com/cortexproject/cortex/blob/master/pkg/distributor/distributor.go#L299) which would be the error family code (eg. 5xx or 4xx). This would allow us to filter the metrics.

**Describe alternatives you've considered**
We could also introduce a new metric for 5xx errors instead of adding a new label for an existent metric. 5xx is mostly what we are looking for to have a better visibility of the health of the system. We could add another metric called ingesterAppendFatalFailures where we would only add errors with responses different from 4xx.
This is still a valuable option, but it just create another metric for an information that we are already generating. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metrics by error type when distributer fail to call ingester #4441

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metrics by error type when distributer fail to call ingester #4441

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions