-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bugfix: fix leaking of Silences matcherCache entries #3930
Conversation
cb904fa
to
a84f89a
Compare
Hello! 👋 Thank you for opening this PR. I haven't had time to do an in-depth review, but my initial impressions are fantastic work! Thank you for tracking this down and also creating a fix. I agree with the decision to change the key from a pointer to the UUID, and it is my understanding that you cannot change the matchers of a silence without creating a new UUID, therefore it should never be possible to read stale matchers from the cache. I'll take an in-depth look later this week. |
silence/silence_test.go
Outdated
}} | ||
} | ||
|
||
cases := map[string]testCase{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more idiomatic to write test cases like this
diff --git a/silence/silence_test.go b/silence/silence_test.go
index c290522a..7a1bf335 100644
--- a/silence/silence_test.go
+++ b/silence/silence_test.go
@@ -114,11 +114,6 @@ func TestSilenceGCOverTime(t *testing.T) {
s *pb.Silence
expectPresentAfterGc bool
}
- type testCase struct {
- initialState []silenceEntry
- updates []silenceEntry
- expectedGCCount int
- }
c := clock.NewMock()
now := c.Now().UTC()
@@ -133,35 +128,39 @@ func TestSilenceGCOverTime(t *testing.T) {
}}
}
- cases := map[string]testCase{
- "gc does not clean active silences": {
- initialState: []silenceEntry{
- {s: newSilence("1", now), expectPresentAfterGc: false},
- {s: newSilence("2", now.Add(-time.Second)), expectPresentAfterGc: false},
- {s: newSilence("3", now.Add(time.Second)), expectPresentAfterGc: true},
- },
+ cases := []struct {
+ name string
+ initialState []silenceEntry
+ updates []silenceEntry
+ expectedGCCount int
+ }{{
+ name: "gc does not clean active silences",
+ initialState: []silenceEntry{
+ {s: newSilence("1", now), expectPresentAfterGc: false},
+ {s: newSilence("2", now.Add(-time.Second)), expectPresentAfterGc: false},
+ {s: newSilence("3", now.Add(time.Second)), expectPresentAfterGc: true},
},
- "silences added with Set are handled correctly": {
- initialState: []silenceEntry{
- {s: newSilence("1", now), expectPresentAfterGc: false},
- },
- updates: []silenceEntry{
- {s: newSilence("", now.Add(time.Second)), expectPresentAfterGc: true},
- {s: newSilence("", now.Add(-time.Second)), expectPresentAfterGc: false},
- },
+ }, {
+ name: "silences added with Set are handled correctly",
+ initialState: []silenceEntry{
+ {s: newSilence("1", now), expectPresentAfterGc: false},
},
- "silence update does not leak state": {
- initialState: []silenceEntry{
- {s: newSilence("1", now), expectPresentAfterGc: false},
- },
- updates: []silenceEntry{
- {s: newSilence("1", now.Add(time.Second)), expectPresentAfterGc: true},
- },
+ updates: []silenceEntry{
+ {s: newSilence("", now.Add(time.Second)), expectPresentAfterGc: true},
+ {s: newSilence("", now.Add(-time.Second)), expectPresentAfterGc: false},
},
- }
+ }, {
+ name: "silence update does not leak state",
+ initialState: []silenceEntry{
+ {s: newSilence("1", now), expectPresentAfterGc: false},
+ },
+ updates: []silenceEntry{
+ {s: newSilence("1", now.Add(time.Second)), expectPresentAfterGc: true},
+ },
+ }}
- for name, tc := range cases {
- t.Run(name, func(t *testing.T) {
+ for _, tc := range cases {
+ t.Run(tc.name, func(t *testing.T) {
silences, err := New(Options{})
silClock := clock.NewMock()
silences.clock = silClock
silence/silence_test.go
Outdated
// simulate this silences being seen in a query | ||
silences.mc.Get(silences.st[sil.s.Id].Silence) | ||
} | ||
silClock.Add(-time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the significance of rewinding the clock, although I can see that if I comment this out the test fails. How does rewinding the clock help us test GC behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out! I did this because adding a silence with Set
that expires before the clock's now
is a no-op. Typically, this case is handled before we reach Set
. However, rewinding the clock is a really unclear way to handle this.
Instead, I should've started with the clock 2 seconds behind and then incremented the clock forward by one second after the initialState
is applied and then again after updates
are applied. That has the exact same behavior, but is much more expressive to the reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review! I will make all the requested stylistic changes and the change to how the test clock is handled.
silence/silence_test.go
Outdated
// simulate this silences being seen in a query | ||
silences.mc.Get(silences.st[sil.s.Id].Silence) | ||
} | ||
silClock.Add(-time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out! I did this because adding a silence with Set
that expires before the clock's now
is a no-op. Typically, this case is handled before we reach Set
. However, rewinding the clock is a really unclear way to handle this.
Instead, I should've started with the clock 2 seconds behind and then incremented the clock forward by one second after the initialState
is applied and then again after updates
are applied. That has the exact same behavior, but is much more expressive to the reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code is fine, but the tests are still very hard to follow and understand. I'm going to make an attempt at refactoring them further in a local branch.
silence/silence_test.go
Outdated
func TestSilenceGCOverTime(t *testing.T) { | ||
type silenceEntry struct { | ||
s *pb.Silence | ||
expectPresentAfterGc bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is more complicated than it needs to be. Much easier if we invert the bool.
expectPresentAfterGc bool | |
expectGC bool |
silence/silence_test.go
Outdated
silences.clock = silClock | ||
|
||
// Set time into the past so that silences will be updated | ||
// before they're endsAt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// before they're endsAt | |
// before their endsAt |
silence/silence_test.go
Outdated
name string | ||
initialState []silenceEntry | ||
updates []silenceEntry | ||
expectedGCCount int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expectedGCCount
is not used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is used on line 194
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No that's a different variable with the same name.
silence/silence_test.go
Outdated
}, | ||
}, | ||
{ | ||
name: "silences added with Set are handled correctly", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handled correctly
What does this mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah this case name isn't great - in this case "handled correctly" just means "all the invariants we're testing remain satisfied"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"all the invariants we're testing remain satisfied"
Yeah but what are the invariants being tested? I'm sort of coming at this from the perspective of someone who hasn't reviewed this PR but needs to understand the tests. It's so hard to understand what is being tested here from looking at the test case:
{
name: "silences added with Set are handled correctly",
initialState: []silenceEntry{
{s: newSilence("1", now), expectPresentAfterGc: false},
},
updates: []silenceEntry{
{s: newSilence("", now.Add(time.Second)), expectPresentAfterGc: true},
{s: newSilence("", now.Add(-time.Second)), expectPresentAfterGc: false},
},
},
This is how I would make the tests simpler. Let me know if I'm missing a test case. func TestSilenceGCOverTime(t *testing.T) {
t.Run("GC does not remove active silences", func(t *testing.T) {
s, err := New(Options{})
require.NoError(t, err)
s.clock = clock.NewMock()
now := s.nowUTC()
s.st = state{
"1": &pb.MeshSilence{Silence: &pb.Silence{Id: "1"}, ExpiresAt: now},
"2": &pb.MeshSilence{Silence: &pb.Silence{Id: "2"}, ExpiresAt: now.Add(-time.Second)},
"3": &pb.MeshSilence{Silence: &pb.Silence{Id: "3"}, ExpiresAt: now.Add(time.Second)},
}
want := state{
"3": &pb.MeshSilence{Silence: &pb.Silence{Id: "3"}, ExpiresAt: now.Add(time.Second)},
}
n, err := s.GC()
require.NoError(t, err)
require.Equal(t, 2, n)
require.Equal(t, want, s.st)
})
// This test checks for a memory leak that occurred in the matcher cache when
// updating an existing silence.
t.Run("Updating an existing silences does not leak cache entries", func(t *testing.T) {
s, err := New(Options{})
require.NoError(t, err)
clock := clock.NewMock()
s.clock = clock
sil1 := &pb.Silence{
Id: "1",
Matchers: []*pb.Matcher{{
Type: pb.Matcher_EQUAL,
Name: "foo",
Pattern: "bar",
}},
StartsAt: clock.Now(),
EndsAt: clock.Now().Add(time.Minute),
}
s.st["1"] = &pb.MeshSilence{Silence: sil1, ExpiresAt: clock.Now().Add(time.Minute)}
// Need to query the silence to populate the matcher cache.
s.Query(QMatches(model.LabelSet{"foo": "bar"}))
require.Len(t, s.mc, 1)
// must clone sil1 before updating it.
sil2 := cloneSilence(sil1)
require.NoError(t, s.Set(sil2))
// The memory leak occurred because updating a silence would add a new
// entry in the matcher cache even though no new silence was created.
// This check asserts that this no longer happens.
require.Len(t, s.st, 1)
require.Len(t, s.mc, 1)
// Move time forward and both silence and cache entry should be garbage
// collected.
clock.Add(time.Minute)
n, err := s.GC()
require.NoError(t, err)
require.Equal(t, 1, n)
require.Len(t, s.st, 0)
require.Len(t, s.mc, 0)
})
} |
This looks good to me, but it does make it a bit harder to add new cases in the future. There's no test here which actually validates that the GC runs as expected with silences added via the normal The Would you like me to replace my test in this PR with this new implementation? |
I think writing these specific tests as table-driven tests make them more difficult to understand. There are lots of cases where table-driven tests do make a lot of sense, but I don't think this is one of those. For example, my comment here highlights what I mean.
We can fix that 👍
👍
Yes please! Let me first add a new comment that has the updated tests including your feedback 👍 |
Sure, I think that's fair enough. The way you've rewritten does seem more readable to me as well. Regardless, I'm very happy to conform to whatever is conventional for Alertmanager.
Alright, great. Thanks! |
func TestSilenceGCOverTime(t *testing.T) {
t.Run("GC does not remove active silences", func(t *testing.T) {
s, err := New(Options{})
require.NoError(t, err)
s.clock = clock.NewMock()
now := s.nowUTC()
s.st = state{
"1": &pb.MeshSilence{Silence: &pb.Silence{Id: "1"}, ExpiresAt: now},
"2": &pb.MeshSilence{Silence: &pb.Silence{Id: "2"}, ExpiresAt: now.Add(-time.Second)},
"3": &pb.MeshSilence{Silence: &pb.Silence{Id: "3"}, ExpiresAt: now.Add(time.Second)},
}
want := state{
"3": &pb.MeshSilence{Silence: &pb.Silence{Id: "3"}, ExpiresAt: now.Add(time.Second)},
}
n, err := s.GC()
require.NoError(t, err)
require.Equal(t, 2, n)
require.Equal(t, want, s.st)
})
t.Run("GC does not leak cache entries", func(t *testing.T) {
s, err := New(Options{})
require.NoError(t, err)
clock := clock.NewMock()
s.clock = clock
sil1 := &pb.Silence{
Matchers: []*pb.Matcher{{
Type: pb.Matcher_EQUAL,
Name: "foo",
Pattern: "bar",
}},
StartsAt: clock.Now(),
EndsAt: clock.Now().Add(time.Minute),
}
require.NoError(t, s.Set(sil1))
// Need to query the silence to populate the matcher cache.
s.Query(QMatches(model.LabelSet{"foo": "bar"}))
require.Len(t, s.st, 1)
require.Len(t, s.mc, 1)
// Move time forward and both silence and cache entry should be garbage
// collected.
clock.Add(time.Minute)
n, err := s.GC()
require.NoError(t, err)
require.Equal(t, 1, n)
require.Len(t, s.st, 0)
require.Len(t, s.mc, 0)
})
t.Run("replacing a silences does not leak cache entries", func(t *testing.T) {
s, err := New(Options{})
require.NoError(t, err)
clock := clock.NewMock()
s.clock = clock
sil1 := &pb.Silence{
Matchers: []*pb.Matcher{{
Type: pb.Matcher_EQUAL,
Name: "foo",
Pattern: "bar",
}},
StartsAt: clock.Now(),
EndsAt: clock.Now().Add(time.Minute),
}
require.NoError(t, s.Set(sil1))
// Need to query the silence to populate the matcher cache.
s.Query(QMatches(model.LabelSet{"foo": "bar"}))
require.Len(t, s.st, 1)
require.Len(t, s.mc, 1)
// must clone sil1 before replacing it.
sil2 := cloneSilence(sil1)
sil2.Matchers = []*pb.Matcher{{
Type: pb.Matcher_EQUAL,
Name: "bar",
Pattern: "baz",
}}
require.NoError(t, s.Set(sil2))
// Need to query the silence to populate the matcher cache.
s.Query(QMatches(model.LabelSet{"bar": "baz"}))
require.Len(t, s.st, 2)
require.Len(t, s.mc, 2)
// Move time forward and both silence and cache entry should be garbage
// collected.
clock.Add(time.Minute)
n, err := s.GC()
require.NoError(t, err)
require.Equal(t, 2, n)
require.Len(t, s.st, 0)
require.Len(t, s.mc, 0)
})
// This test checks for a memory leak that occurred in the matcher cache when
// updating an existing silence.
t.Run("updating a silences does not leak cache entries", func(t *testing.T) {
s, err := New(Options{})
require.NoError(t, err)
clock := clock.NewMock()
s.clock = clock
sil1 := &pb.Silence{
Id: "1",
Matchers: []*pb.Matcher{{
Type: pb.Matcher_EQUAL,
Name: "foo",
Pattern: "bar",
}},
StartsAt: clock.Now(),
EndsAt: clock.Now().Add(time.Minute),
}
s.st["1"] = &pb.MeshSilence{Silence: sil1, ExpiresAt: clock.Now().Add(time.Minute)}
// Need to query the silence to populate the matcher cache.
s.Query(QMatches(model.LabelSet{"foo": "bar"}))
require.Len(t, s.mc, 1)
// must clone sil1 before updating it.
sil2 := cloneSilence(sil1)
require.NoError(t, s.Set(sil2))
// The memory leak occurred because updating a silence would add a new
// entry in the matcher cache even though no new silence was created.
// This check asserts that this no longer happens.
require.Len(t, s.st, 1)
require.Len(t, s.mc, 1)
// Move time forward and both silence and cache entry should be garbage
// collected.
clock.Add(time.Minute)
n, err := s.GC()
require.NoError(t, err)
require.Equal(t, 1, n)
require.Len(t, s.st, 0)
require.Len(t, s.mc, 0)
})
} |
// The memory leak occurred because updating a silence would add a new | ||
// entry in the matcher cache even though no new silence was created. | ||
// This check asserts that this no longer happens. | ||
s.Query(QMatches(model.LabelSet{"foo": "bar"})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this line because the memory leak only occurs if the matcher cache is populated because of a query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gotjosh @simonpasquier I approve this fix. Please take a look so we can get it into the next release! Thanks 💯
Sorry, there are a couple of lint failures where we need to use |
Can you also rebase |
Signed-off-by: Ethan Hunter <ehunter@hudson-trading.com>
Signed-off-by: Ethan Hunter <ehunter@hudson-trading.com>
Signed-off-by: Ethan Hunter <ehunter@hudson-trading.com>
Signed-off-by: Ethan Hunter <ehunter@hudson-trading.com>
Signed-off-by: Ethan Hunter <ehunter@hudson-trading.com>
Signed-off-by: Ethan Hunter <ehunter@hudson-trading.com>
0901171
to
f345aab
Compare
It looks like that test is still broken - is it possible that it's broken on main? As far as I can tell, the diff between this branch and main doesn't include any frontend files. |
Looks like all PRs are broken due to frontend tests. We will fix it 👍 You'll need to rebase a second time once it's fixed. |
Thank you very much for your contribution and @grobinson-grafana for reviewing. |
* [CHANGE] Deprecate and remove api/v1/ #2970 * [CHANGE] Remove unused feature flags #3676 * [CHANGE] Newlines in smtp password file are now ignored #3681 * [CHANGE] Change compat metrics to counters #3686 * [CHANGE] Do not register compat metrics in amtool #3713 * [CHANGE] Remove metrics from compat package #3714 * [CHANGE] Mark muted alerts #3793 * [FEATURE] Add metric for inhibit rules #3681 * [FEATURE] Support UTF-8 label matchers #3453, #3507, #3523, #3483, #3567, #3568, #3569, #3571, #3595, #3604, #3619, #3658, #3659, #3662, #3668, 3572 * [FEATURE] Add counter to track alerts dropped outside of time_intervals #3565 * [FEATURE] Add date and tz functions to templates #3812 * [FEATURE] Add limits for silences #3852 * [FEATURE] Add time helpers for templates #3863 * [FEATURE] Add auto GOMAXPROCS #3837 * [FEATURE] Add auto GOMEMLIMIT #3895 * [FEATURE] Add Jira receiver integration #3590 * [ENHANCEMENT] Add the receiver name to notification metrics #3045 * [ENHANCEMENT] Add the route ID to uuid #3372 * [ENHANCEMENT] Add duration to the notify success message #3559 * [ENHANCEMENT] Implement webhook_url_file for discord and msteams #3555 * [ENHANCEMENT] Add debug logs for muted alerts #3558 * [ENHANCEMENT] API: Allow the Silences API to use their own 400 response #3610 * [ENHANCEMENT] Add summary to msteams notification #3616 * [ENHANCEMENT] Add context reasons to notifications failed counter #3631 * [ENHANCEMENT] Add optional native histogram support to latency metrics #3737 * [ENHANCEMENT] Enable setting ThreadId for Telegram notifications #3638 * [ENHANCEMENT] Allow webex roomID from template #3801 * [BUGFIX] Add missing integrations to notify metrics #3480 * [BUGFIX] Add missing ttl in pushhover #3474 * [BUGFIX] Fix scheme required for webhook url in amtool #3409 * [BUGFIX] Remove duplicate integration from metrics #3516 * [BUGFIX] Reflect Discord's max length message limits #3597 * [BUGFIX] Fix nil error in warn logs about incompatible matchers #3683 * [BUGFIX] Fix a small number of inconsistencies in compat package logging #3718 * [BUGFIX] Fix log line in featurecontrol #3719 * [BUGFIX] Fix panic in acceptance tests #3592 * [BUGFIX] Fix flaky test TestClusterJoinAndReconnect/TestTLSConnection #3722 * [BUGFIX] Fix crash on errors when url_file is used #3800 * [BUGFIX] Fix race condition in dispatch.go #3826 * [BUGFIX] Fix race conditions in the memory alerts store #3648 * [BUGFIX] Hide config.SecretURL when the URL is incorrect. #3887 * [BUGFIX] Fix invalid silence causes incomplete updates #3898 * [BUGFIX] Fix leaking of Silences matcherCache entries #3930 * [BUGFIX] Close SMTP submission correctly to handle errors #4006 Signed-off-by: SuperQ <superq@gmail.com>
* Release v0.28.0-rc.0 * [CHANGE] Templating errors in the SNS integration now return an error. #3531 #3879 * [FEATURE] Add a new Microsoft Teams integration based on Flows #4024 * [FEATURE] Add a new Rocket.Chat integration #3600 * [FEATURE] Add a new Jira integration #3590 #3931 * [FEATURE] Add support for `GOMEMLIMIT`, enable it via the feature flag `--enable-feature=auto-gomemlimit`. #3895 * [FEATURE] Add support for `GOMAXPROCS`, enable it via the feature flag `--enable-feature=auto-gomaxprocs`. #3837 * [FEATURE] Add support for limits of silences including the maximum number of active and pending silences, and the maximum size per silence (in bytes). You can use the flags `--silences.max-silences` and `--silences.max-silence-size-bytes` to set them accordingly #3852 #3862 #3866 #3885 #3886 #3877 * [FEATURE] Muted alerts now show whether they are suppressed or not in both the `/api/v2/alerts` endpoint and the Alertmanager UI. #3793 #3797 #3792 * [ENHANCEMENT] Add support for `content`, `username` and `avatar_url` in the Discord integration. `content` and `username` also support templating. #4007 * [ENHANCEMENT] Only invalidate the silences cache if a new silence is created or an existing silence replaced - should improve latency on both `GET api/v2/alerts` and `POST api/v2/alerts` API endpoint. #3961 * [ENHANCEMENT] Add image source label to Dockerfile. To get changelogs shown when using Renovate #4062 * [ENHANCEMENT] Build using go 1.23 #4071 * [ENHANCEMENT] Support setting a global SMTP TLS configuration. #3732 * [ENHANCEMENT] The setting `room_id` in the WebEx integration can now be templated to allow for dynamic room IDs. #3801 * [ENHANCEMENT] Enable setting `message_thread_id` for the Telegram integration. #3638 * [ENHANCEMENT] Support the `since` and `humanizeDuration` functions to templates. This means users can now format time to more human-readable text. #3863 * [ENHANCEMENT] Support the `date` and `tz` functions to templates. This means users can now format time in a specified format and also change the timezone to their specific locale. #3812 * [ENHANCEMENT] Latency metrics now support native histograms. #3737 * [BUGFIX] Fix the SMTP integration not correctly closing an SMTP submission, which may lead to unsuccessful dispatches being marked as successful. #4006 * [BUGFIX] The `ParseMode` option is now set explicitly in the Telegram integration. If we don't HTML tags had not been parsed by default. #4027 * [BUGFIX] Fix a memory leak that was caused by updates silences continuously. #3930 * [BUGFIX] Fix hiding secret URLs when the URL is incorrect. #3887 * [BUGFIX] Fix a race condition in the alerts - it was more of a hypothetical race condition that could have occurred in the alert reception pipeline. #3648 * [BUGFIX] Fix a race condition in the alert delivery pipeline that would cause a firing alert that was delivered earlier to be deleted from the aggregation group when instead it should have been delivered again. #3826 * [BUGFIX] Fix version in APIv1 deprecation notice. #3815 * [BUGFIX] Fix crash errors when using `url_file` in the Webhook integration. #3800 * [BUGFIX] fix `Route.ID()` returns conflicting IDs. #3803 * [BUGFIX] Fix deadlock on the alerts memory store. #3715 * [BUGFIX] Fix `amtool template render` when using the default values. #3725 * [BUGFIX] Fix `webhook_url_file` for both the Discord and Microsoft Teams integrations. #3728 #3745 --------- Signed-off-by: SuperQ <superq@gmail.com> Signed-off-by: gotjosh <josue.abreu@gmail.com> Co-authored-by: gotjosh <josue.abreu@gmail.com>
* [CHANGE] Templating errors in the SNS integration now return an error. #3531 #3879 * [CHANGE] Adopt log/slog, drop go-kit/log #4089 * [FEATURE] Add a new Microsoft Teams integration based on Flows #4024 * [FEATURE] Add a new Rocket.Chat integration #3600 * [FEATURE] Add a new Jira integration #3590 #3931 * [FEATURE] Add support for `GOMEMLIMIT`, enable it via the feature flag `--enable-feature=auto-gomemlimit`. #3895 * [FEATURE] Add support for `GOMAXPROCS`, enable it via the feature flag `--enable-feature=auto-gomaxprocs`. #3837 * [FEATURE] Add support for limits of silences including the maximum number of active and pending silences, and the maximum size per silence (in bytes). You can use the flags `--silences.max-silences` and `--silences.max-silence-size-bytes` to set them accordingly #3852 #3862 #3866 #3885 #3886 #3877 * [FEATURE] Muted alerts now show whether they are suppressed or not in both the `/api/v2/alerts` endpoint and the Alertmanager UI. #3793 #3797 #3792 * [ENHANCEMENT] Add support for `content`, `username` and `avatar_url` in the Discord integration. `content` and `username` also support templating. #4007 * [ENHANCEMENT] Only invalidate the silences cache if a new silence is created or an existing silence replaced - should improve latency on both `GET api/v2/alerts` and `POST api/v2/alerts` API endpoint. #3961 * [ENHANCEMENT] Add image source label to Dockerfile. To get changelogs shown when using Renovate #4062 * [ENHANCEMENT] Build using go 1.23 #4071 * [ENHANCEMENT] Support setting a global SMTP TLS configuration. #3732 * [ENHANCEMENT] The setting `room_id` in the WebEx integration can now be templated to allow for dynamic room IDs. #3801 * [ENHANCEMENT] Enable setting `message_thread_id` for the Telegram integration. #3638 * [ENHANCEMENT] Support the `since` and `humanizeDuration` functions to templates. This means users can now format time to more human-readable text. #3863 * [ENHANCEMENT] Support the `date` and `tz` functions to templates. This means users can now format time in a specified format and also change the timezone to their specific locale. #3812 * [ENHANCEMENT] Latency metrics now support native histograms. #3737 * [ENHANCEMENT] Add timeout option for webhook notifier. #4137 * [BUGFIX] Fix the SMTP integration not correctly closing an SMTP submission, which may lead to unsuccessful dispatches being marked as successful. #4006 * [BUGFIX] The `ParseMode` option is now set explicitly in the Telegram integration. If we don't HTML tags had not been parsed by default. #4027 * [BUGFIX] Fix a memory leak that was caused by updates silences continuously. #3930 * [BUGFIX] Fix hiding secret URLs when the URL is incorrect. #3887 * [BUGFIX] Fix a race condition in the alerts - it was more of a hypothetical race condition that could have occurred in the alert reception pipeline. #3648 * [BUGFIX] Fix a race condition in the alert delivery pipeline that would cause a firing alert that was delivered earlier to be deleted from the aggregation group when instead it should have been delivered again. #3826 * [BUGFIX] Fix version in APIv1 deprecation notice. #3815 * [BUGFIX] Fix crash errors when using `url_file` in the Webhook integration. #3800 * [BUGFIX] fix `Route.ID()` returns conflicting IDs. #3803 * [BUGFIX] Fix deadlock on the alerts memory store. #3715 * [BUGFIX] Fix `amtool template render` when using the default values. #3725 * [BUGFIX] Fix `webhook_url_file` for both the Discord and Microsoft Teams integrations. #3728 #3745 * [BUGFIX] Fix wechat api link #4084 * [BUGFIX] Fix build info metric #4166 Signed-off-by: SuperQ <superq@gmail.com>
There's a small memory leak from the
matcherCache
when a silence is updated in place by this branch ofSilences.Set
:Silences.Set
will always create a new silence instance. IfcanUpdate
is true, the new instance will replace the old one in the silences state. However, thematcherCache
is keyed by the pointer to the instance so the entry in the matcherCache is ends up dangling. This means that the both compiled matchers and the silence itself are leaked. TheSilences.GC
run doesn't take care of this because it never searches for dangling references in the matcherCache.We've observed this issue in the real world running a slightly modified version of 0.26.0. In this PR, I've added a new test (
TestSilenceGCOverTime
) which fails when the matcherCache leaks entries. This test still fails when run against the latest code on main:There are a few ways to fix this, but I've chosen to modify
matcherCache
to use the silence UUID as the cache key instead of the pointer to the silence instance. I think this is the best fix because it removes the fragile assumption that thepb.Silence
pointer will never change over the lifecycle of a silence and replaces it with the existing assumption that the silence's matchers will not change over the lifecycle of a silence. To state this a different way: this fix removes a required invariant and does not add any new ones.This fix causes the new test cases to pass and has been running in our environment for a while without any problems.
I'm not 100% sure, but I suspect this is the root cause of #2659