Description
Is your feature request related to a problem? Please describe.
When using cortex, we sometimes have calls between distributor and ingester that might fails. We do have metrics for call made to ingester as well as metrics for failed responses. These failed responses collect any type of error. Some of these errors as 4xx does not show an issue on the system while errors like 5xx do show that something unexpected is happening with ingester. This could lead us to have a bad ingester without actually knowing it, because ingesterAppendFailures metric does not mean we have an issue. This become a bigger issue when a second ingester is being restarted or goes down. We thought that we were being redundant running with 3 ingester, but actually we were already running with limit capacity 2 out of 3.
It would be of good value to have a differentiation between which error we are receiving from ingester.
Describe the solution you'd like
The solution proposed would be adding a new label to the existent metric which would be the error family code (eg. 5xx or 4xx). This would allow us to filter the metrics.
Describe alternatives you've considered
We could also introduce a new metric for 5xx errors instead of adding a new label for an existent metric. 5xx is mostly what we are looking for to have a better visibility of the health of the system. We could add another metric called ingesterAppendFatalFailures where we would only add errors with responses different from 4xx.
This is still a valuable option, but it just create another metric for an information that we are already generating.