-
Notifications
You must be signed in to change notification settings - Fork 820
Retrying AlertManager UnaryPath Requests on next replica if one fail #4422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retrying AlertManager UnaryPath Requests on next replica if one fail #4422
Conversation
849ab96
to
cadcb5a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks; looks broadly fine but I made a couple of suggestions.
Signed-off-by: Alan Protasio <approtas@amazon.com>
Signed-off-by: Alan Protasio <approtas@amazon.com>
Signed-off-by: Alan Protasio <approtas@amazon.com>
Co-authored-by: Bryan Boreham <bjboreham@gmail.com> Signed-off-by: Alan Protasio <approtas@amazon.com>
Signed-off-by: Alan Protasio <approtas@amazon.com>
7cfe8e5
to
4af5b0b
Compare
Thanks for working on this. I can add some context: The problem with creating silences is that the operation is not idempotent in [upstream] alertmanager. Each API call creates a new silence with a new UUID, which are then replicated to the peers. Adding retries mean you could end up with multiple silences being created from a single API call. e.g. The API call to the first replica is started but doesn't return (e.g. maybe a timeout is hit). That silence is created and the replicated around. If we retry, then the calls to the second/third replica will continue and create duplicates. It's arguable what behavior is better, ideally, we would add an API to upstream alertmanager allowing idempotent silence creation, but whilst this is certainly achievable, it's non-trivial. The same problem exists with deletion, it is not idempotent (it returns 4xx if the ID is not found), so you could delete from one replica, try the second, and it has already been deleted due to replication, so you get a 4xx error. This is not such a big problem as for creation. With regards to |
Hi @stevesg Yeah.. i was indeed thinking about the Put Silence... I will see if there is something we can do there. About the delete i think is fine right? If i get a 4xx on the first and retry in the second, it should be fine. For |
My over-arching point/question was whether we should aim to use quorum operations for everything, so the distributor is logically as transparent as possible (modulo result merging).
Agreed we can probably make delete work with upstream as-is, preserving the alertmanager API semantic of returning 404 is the annoying bit. Wondering if we should just do quorum write here too? This would make deletes be synchronously replicated. The resulting merging could be along the lines of: if all 404, then 404 else if all 200 or 404 then 200 else error.
Absolutely, I was suggesting that you could use the existing quorum mechanism to save adding complexity. |
For adding a silence, could we change the API so that the caller makes up a UUID for the request and passes that in, rather than the alertmanager making one and returning it? |
Hi @bboreham and @stevesg .. Thanks for the inputs.
Will it be tricky to do this using @bboreham The upstream alertmanager API? I like the idea of making the caller send the UUID, but this seems to be a breaking change, no? Besides that, can we change this PR to address the read/delete path and create a new one/issue to address the Put Silence? |
Yes I did mean change the API, but we could have it fall back to the existing behaviour if no UUID supplied, for backwards-compatibility.
Fine. |
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions. |
Still relevant.. |
… AlertManager UnaryPath GET Requests on next replica if one fail. Signed-off-by: Krishna Teja Puttagunta <krishtez@amazon.com>
Closing as this is being worked on #4840 |
… AlertManager UnaryPath GET Requests on next replica if one fail. Signed-off-by: Krishna Teja Puttagunta <krishtez@amazon.com>
… AlertManager UnaryPath GET Requests on next replica if one fail. Signed-off-by: Krishna Teja Puttagunta <krishtez@amazon.com>
… AlertManager UnaryPath GET Requests on next replica if one fail. Signed-off-by: Krishna Teja Puttagunta <krishtez@amazon.com>
… AlertManager UnaryPath GET Requests on next replica if one fail. Signed-off-by: Krishna Teja Puttagunta <krishtez@amazon.com>
… AlertManager UnaryPath GET Requests on next replica if one fail. Signed-off-by: Krishna Teja Puttagunta <krishtez@amazon.com>
… AlertManager UnaryPath GET Requests on next replica if one fail. Signed-off-by: Krishna Teja Puttagunta <krishtez@amazon.com>
… AlertManager UnaryPath GET Requests on next replica if one fail. Signed-off-by: Krishna Teja Puttagunta <krishtez@amazon.com>
… AlertManager UnaryPath GET Requests on next replica if one fail. Signed-off-by: Krishna Teja Puttagunta <krishtez@amazon.com>
… UnaryPath GET Requests on next replica if one fail. (#4840) Signed-off-by: Krishna Teja Puttagunta <krishtez@amazon.com> Signed-off-by: Krishna Teja Puttagunta <krishtez@amazon.com> Signed-off-by: Krishna Teja Puttagunta <krishnateja325@gmail.com>
What this PR does:
AlertManager unary APIs are failing due a single alert manager fail as distributor picks a random one to forward the request.
This change only pick the replicas alert managers in case of a error.
Which issue(s) this PR fixes:
Fixes #
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]