Skip to content

Add experimental stage that will catch race conditions between nflog and pipeline#268

Merged
yuri-tceretian merged 6 commits into
mainfrom
yuri-tceretian/gma-dedup-stage
Jan 29, 2025
Merged

Add experimental stage that will catch race conditions between nflog and pipeline#268
yuri-tceretian merged 6 commits into
mainfrom
yuri-tceretian/gma-dedup-stage

Conversation

@yuri-tceretian
Copy link
Copy Markdown
Collaborator

This PR adds a new stage to the pipeline that is similar to DedupStage but it checks that the timestamp when pipeline was flushed matches the timestamp of the state: if latter is greater than the former then it is indicator that the pipeline was already executed.

It provides the ability to configure how to react in this case: just log or stop the pipeline. This will be controlled by a feature flag in Grafana.

This is a non-intrusive way to estimate how often the situation described in
https://github.com/prometheus/alertmanager/pull/3283/files
happens in reality and the ability to stop the notification pipeline in that case.

@yuri-tceretian yuri-tceretian requested a review from a team as a code owner January 22, 2025 22:28
Comment thread notify/grafana_alertmanager.go Outdated
Comment thread notify/grafana_alertmanager.go Outdated
Comment thread notify/grafana_alertmanager.go Outdated
Comment thread notify/grafana_alertmanager.go Outdated
Comment thread notify/pipeline/coordination_stage.go Outdated
Copy link
Copy Markdown
Contributor

@titolins titolins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

Comment thread notify/grafana_alertmanager.go Outdated
@yuri-tceretian yuri-tceretian force-pushed the yuri-tceretian/gma-dedup-stage branch from b8cca87 to 670638a Compare January 24, 2025 15:54
@yuri-tceretian yuri-tceretian force-pushed the yuri-tceretian/gma-dedup-stage branch from 670638a to f101d43 Compare January 24, 2025 16:00
@yuri-tceretian yuri-tceretian merged commit 45781ef into main Jan 29, 2025
@yuri-tceretian yuri-tceretian deleted the yuri-tceretian/gma-dedup-stage branch January 29, 2025 21:57
rwwiv pushed a commit that referenced this pull request Jul 29, 2025
rwwiv added a commit that referenced this pull request Jul 29, 2025
* Alerting: Adding color option for slack receiver (#270)

Adding color option for slack receiver

Closes #251
Related to grafana/grafana#99615

* Update victorops to support loading url from secrets (#272)

* Add experimental stage that will catch race conditions between nflog and pipeline (#268)

* Alerting: Refactor extendAlert function (#267)

* Alerting: Refactor extendAlert function

Clean it up to require less mutation of a single url.URL struct.

Now uses a baseURL + shallow copying in distinct generator functions.

* Reorder field calculations to prevent early returns from unrelated parse errors

* Debug -> Warn

* Move orgID setQueryParam to baseURL

* Add EmbeddedContents and use in email notifier (#275)

* Add EmbeddedContents to email sender and use in email notifier

* Revert "Validate retry and expire in Pushover receiver (#233)" (#244)

This reverts commit 70248a7.

* Add more context to log in PipelineAndStateTimestampCoordinationStage (#277)

* add more context to warning log

* add integration and receiver names

* Update notify/stages/coordination_stage.go

Co-authored-by: Tito Lins <tito.linsesilva@grafana.com>


---------

Co-authored-by: Tito Lins <tito.linsesilva@grafana.com>

* Update Alertmanager fork to latest commit (#279)

Ad upstream jira integration

* Copy http client from Grafana (#281)

* copy from Grafana

* make client implement interface

* move comment

* make user agent configurable

to be able to use in mimir

* copy tests from mimir

* refactor BuildReceiverIntegrations to accept config instead of factory

* remove useless tests

* lint

* Alerting: Fix token-based Slack image upload to work with channel names (#284)

* Alerting: Fix slack image upload to work with channel names

Previous change broken slack integrations defined with recipient
as channel name. This is because files_upload_v2 doesn't support
channel names anymore and requires channel ID. Now we obtain the
channel ID from the chat.postMessage response for bot token
image uploads.

* Fix flaky test caused by non-deterministic InitialComment

* Replace drone with GH actions (#282)

* [SKIP CI] Replace drone with GH actions

* delete drone

* Alerting: Sanitize Slack image upload comment labels (#286)

* Add failing regression test

* Construct Slack image upload initial comment using sanitized labels

* Remove redundant alert name label in comment

* New image Provider abstraction to allow for remote AM screenshots (#276)

* Add annotation to hold image url

* Simplify image Provider interface

Previous abstraction change introduced GetImageURL and GetRawImage.
These methods were meant to replace GetImage, however only Discord was
using it.

Switching to these methods would require significant rewrites in
integrations to not use WithStoredImages. So, instead I changed the
abstraction to one that is more compatible with WithStoredImages.

This new abstraction satisfies the same requirements as the previous
in that it hides Path and File information as implementation details
so that Providers can control access to raw data.

* Implement simple URLProvider that retrieves url from annotation

Provider for remote/external AMs, does not allow retrieving raw data and
requires no datastore access.

* Implement TokenProvider that uses token from annotation to query TokenStore

Provider for local AMs, TokenStore results should be considered trusted and
allow retrieving RawData.

* Convert discord to new abstraction

- Goes back to using WithStoredImages.
- Flattens conditionals.
- Less reliance on errors for control flow.

* Convert email to new abstraction

* Convert telegram to new abstraction

- Fairly simple drop-in replacement of os.Open with image.RawData
- Attached form filename is now image Name (basename) instead of full path

* Convert pushover to new abstraction

- Fairly simple drop-in replacement of os.Open with image.RawData
- Attached form filename is now image Name (basename) instead of full path

* Convert slack to new abstraction

- Fairly simple drop-in replacement of os.Open with image.RawData
- Attached form filename is now image Name (basename) instead of full path
- Previous implementation used os.Stat to get file size for creating
the upload URL before reading the data for upload. Now, since we
don't know the size before reading, I first read the
raw data and then create the upload URL once we know the size.

* Fix tests and reimplement FakeProvider

Previous FakeProvider relied on Path and actually creating
files on disk for some tests since integrations had access
to and used these values directly. New abstraction hides this,
so the FakeProvider is much simpler. Testing disk reads should
be left to the Grafana-side provider implementation.

* Remove unused errors

* Address review comments

* Fix over-specific test assert

* Use correct image size in slack multipart upload

* Add Jira integration (#280)

Co-authored-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>

* Use correct template function in sns receiver (#289)

* Support HMAC SHA256 request signing in the http client (#285)

Adds optional HMAC SHA256 signing to the HTTP client.

A new HMACRoundTripper (in http/hmac.go) computes and adds a signature for outgoing webhooks. The signature is computed from the request body.

Optionally, a timestamp header name can be specified, and if it's set, the signature is computed from {timestamp}:{body}, and the timestamp is added to the headers for the client to verify.

* Try additional scopes when testing templates with root scope fails (#290)

Expand template testing to try additional scopes if the root scope fails.
This mitigates errors for definitions like pagerduty.default.instances,
which require the .Alerts scope. Added support for .Alerts and .Alert
scopes.

* Fix UTF-8 not allowed in Equal field for inhibition rules (#271)

Add tests for inhibit rules Equals and EqualsStr

This commit adds tests for decoding inhibition rules as JSON and YAML,
to make sure that the new EqualsStr field is decoded into Equals.
See prometheus/alertmanager#4177.

* WaitStage for Grafana Alertmanager (#278)

* Revert "Fix UTF-8 not allowed in Equal field for inhibition rules" (#298)

Revert "Fix UTF-8 not allowed in Equal field for inhibition rules (#271)"

This reverts commit 6fd0c49.

* Include time range in template dashboard and panel urls (#296)

Add from=&to= to dashboard and panel url annotations on ExtendedAlert

Improve context of linked dashboards and panels to more specifically
target the range of firing alert.

Firing: from=StartsAt-1hr&to=Current time
Resolved: from=StartsAt-1hr&to=EndsAt

* Disable keepalives in NewTLSClient (#359)

This client is only used in short lived contexts so we don't want to
keep the connections it opens alive since they can't be reused.

* Update moved NewTLSClient with changes from #359

---------

Co-authored-by: Garret Wyman <garret_wyman@hotmail.com>
Co-authored-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>
Co-authored-by: Matthew Jacobson <matthew.jacobson@grafana.com>
Co-authored-by: Alexander Akhmetov <me@alx.cx>
Co-authored-by: Tito Lins <tito.linsesilva@grafana.com>
Co-authored-by: Santiago <santiagohernandez.1997@gmail.com>
Co-authored-by: George Robinson <george.robinson@grafana.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

2 participants