feat(alertmanager): add service for alertmanager #7136

grandwizard28 · 2025-02-17T12:58:42Z

Summary

adds an alertmanager service

Related Issues / PR's

Screenshots

NA

Affected Areas and Manually Tested Areas

Important

Add Alertmanager service with configuration and state management, and API endpoints for alert handling.

Alertmanager Service:
- Introduces Alertmanager interface in alertmanager.go with methods GetAlerts, PutAlerts, and TestReceiver.
- Implements Service struct in service.go to manage alertmanager servers per organization.
- Adds SyncServers to synchronize alertmanager servers for all organizations.
Configuration Management:
- Adds ConfigStore interface and config struct in sqlalertmanagerstore/config.go for managing alertmanager configurations.
- Implements Get, Set, and ListOrgs methods for configuration management.
State Management:
- Adds StateStore interface and state struct in sqlalertmanagerstore/state.go for managing alertmanager states.
- Implements Get and Set methods for state management.
API Endpoints:
- Introduces API struct in api.go with methods GetAlerts, TestReceiver, GetChannels, GetChannelByID, UpdateChannelByID, and DeleteChannelByID.
- Handles authentication and error rendering for API requests.
Miscellaneous:
- Adds error codes in errors/code.go for various error scenarios.
- Updates server.go to use new state management methods for silences and notification logs.

^{This description was created by}^{for 1c69201. It will automatically update as commits are pushed.}

ellipsis-dev

❌ Changes requested. Reviewed everything up to 7fd4e18 in 3 minutes and 48 seconds

More details

Looked at 1012 lines of code in 14 files
Skipped 0 files when reviewing.
Skipped posting 6 drafted comments based on config settings.

1. pkg/errors/code.go:9

Draft comment:
These error scenarios are already defined using typ struct. Consider consolidating to avoid maintaining parallel error definitions.
Error types constants block (type.go)
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50%
The comment raises a valid point about parallel error definitions. However, there may be a deliberate design choice to have both 'type' and 'code' representations for errors. Without understanding the broader error handling system and how these two concepts differ in usage, suggesting consolidation could be premature. The comment also doesn't provide clear guidance on how to consolidate.
I might be missing important architectural context about why these parallel structures exist. Perhaps there's a good reason to have both type and code representations.
While the duplication is real, suggesting consolidation without understanding the architectural need for separation could lead to incorrect advice.
The comment should be deleted because it requires broader architectural context to determine if this is actually a problem that needs fixing.

2. pkg/alertmanager/internalalertmanager/provider.go:38

Draft comment:
Consider buffering the syncC channel (e.g. make(chan struct{}, 1)) to prevent potential deadlock if no receiver is ready when ticker fires.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 30% vs. threshold = 50%
The code shows a goroutine writing to syncC when ticker fires, but we don't see the consumer side. The unbuffered channel could potentially block if no receiver is ready, but the goroutine pattern used here is common and safe - the channel is returned before the goroutine writes to it, so a consumer should be ready. The Stop() method also properly cleans up by stopping the ticker and closing the channel.
I might be underestimating the risk - if the consumer isn't listening continuously, messages could be missed. Also, what if multiple ticks occur before consumption?
The current design appears intentional - using an unbuffered channel ensures synchronization between producer and consumer. If messages need to be buffered, that should be a higher-level architectural decision, not a code review comment.
While the suggestion isn't wrong, the current unbuffered channel implementation appears to be a valid design choice with proper cleanup. The comment is more of a design suggestion than a clear issue.

3. pkg/types/alertmanagertypes/channel.go:55

Draft comment:
The reflection loop iterates over the receiver struct fields on every call. Consider caching reflection metadata to improve performance.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

4. pkg/types/alertmanagertypes/config.go:161

Draft comment:
MD5 is being used for config hashing. Ensure this is acceptable given its use case (detecting changes) since MD5 is weak for security purposes.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

5. pkg/alertmanager/service.go:79

Draft comment:
Empty server.Config is passed to server.New. Clarify if this is intentional or if the service-level config should be used to initialize the server.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50%
The code appears to follow a two-step initialization pattern: 1) Create server with empty config 2) Set the actual config via SetConfig(). This seems intentional since the config is retrieved separately and error handling is done for both steps. The service-level config is used differently - for creating default configs when needed.
I could be wrong about the two-step initialization being intentional - maybe it's an oversight. The service.config field's purpose isn't completely clear from this file alone.
The consistent error handling and separation of server creation from config setting strongly suggests this is an intentional pattern. If it wasn't intended, the code would likely try to set the config in one step.
The comment should be deleted as the empty config appears to be intentional design, with config being set in a separate step right after server creation.

6. pkg/types/alertmanagertypes/channel.go:56

Draft comment:
Reflection is used in NewChannelFromReceiver to inspect receiver fields. Consider potential performance implications if receivers have many fields or if this function is called frequently.
Reason this comment was not posted:
Confidence changes required: 50% <= threshold 50%
None

Workflow ID: wflow_U8qoyKc3fag4A4C6

Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

pkg/alertmanager/internalalertmanager/provider.go

ellipsis-dev

👍 Looks good to me! Incremental review on 00ddc81 in 1 minute and 18 seconds

More details

Looked at 30 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 4 drafted comments based on config settings.

1. pkg/sqlstore/sqlitesqlstore/provider.go:40

Draft comment:
Removed inline log query hook addition. Confirm that sqlstore.NewBunDB with hooks supplies equivalent logging functionality. This removal may affect query-level logging if not properly handled.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

2. pkg/sqlstore/sqlitesqlstore/provider.go:3

Draft comment:
Removed unused import for slog. Ensure that any required logging functionality is provided through the settings.Logger instance passed in via configuration.
Reason this comment was not posted:
Comment looked like it was already resolved.

3. pkg/sqlstore/sqlitesqlstore/provider.go:44

Draft comment:
Removed explicit creation of bun.DB and the addition of a logging query hook (via sqlstorehook.NewLogging). Please ensure that sqlstore.NewBunDB properly registers all necessary hooks (including any logging/tracing) so that query diagnostics aren’t lost.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

4. pkg/sqlstore/sqlitesqlstore/provider.go:3

Draft comment:
The imports for 'log/slog' and 'sqlstorehook' have been removed. Confirm that these were solely used for the logging hook functionality, and that the functionality is now fully handled via the hooks passed to sqlstore.NewBunDB.
Reason this comment was not posted:
Marked as duplicate.

Workflow ID: wflow_JyFOCkCbDFJDfur6

You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

ellipsis-dev

👍 Looks good to me! Incremental review on 831d81e in 2 minutes and 4 seconds

More details

Looked at 415 lines of code in 5 files
Skipped 0 files when reviewing.
Skipped posting 8 drafted comments based on config settings.

1. pkg/alertmanager/alertmanagerstore/sqlalertmanagerstore/state.go:49

Draft comment:
Consider logging or handling the error from tx.Rollback (even if using nolint) for completeness.
Reason this comment was not posted:
Confidence changes required: 33% <= threshold 50%
None

2. pkg/alertmanager/server/server.go:140

Draft comment:
Similar state persistence logic is duplicated in both silences and nflog maintenance routines. Consider refactoring common code to improve maintainability.
Reason this comment was not posted:
Comment was on unchanged code.

3. pkg/types/alertmanagertypes/state.go:52

Draft comment:
The switch in Set() uses base64.StdEncoding (and Get() too) while the previous version used RawStdEncoding. Ensure this encoding change is intentional and consistent across components.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50%
While encoding changes could be important, this comment is speculative without showing the previous code. It's asking for verification rather than pointing out a clear issue. It starts with "Ensure this..." which violates our rules. We can't verify if there's an actual problem without seeing the old code.
The encoding change could potentially cause compatibility issues with existing stored data. This might be a legitimate concern worth investigating.
However, we don't have evidence in the diff of the previous encoding type, and the comment is phrased as a verification request rather than pointing out a concrete issue. The author likely considered encoding compatibility when making these changes.
Delete the comment because it's speculative, asks for verification, and requires knowledge of code outside the diff to evaluate.

4. pkg/alertmanager/alertmanagerstore/sqlalertmanagerstore/state.go:51

Draft comment:
Using an upsert with a WHERE clause and updating both ‘silences’ and ‘nflog’ may lead to race conditions when concurrent maintenance routines update only one field. Consider splitting updates or merging changes to avoid clobbering concurrent updates.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50%
While the comment raises an interesting point about concurrent updates, the new code actually uses a transaction and updates both fields atomically in a single upsert. This is generally safer than updating fields separately. The old code was more prone to race conditions by handling states separately. The new approach seems like an improvement.
I might be overlooking specific business requirements where silences and nflog need to be updated independently. The comment author may have domain knowledge about maintenance routines I'm not aware of.
Even if there are maintenance routines, the atomic update within a transaction is a sound approach that prevents partial updates. If independent updates are needed, that should be a separate feature request with clear requirements.
The comment should be deleted as it suggests a potential problem that isn't actually present - the new code uses transactions and atomic updates which is good practice for preventing race conditions.

5. pkg/alertmanager/server/server.go:131

Draft comment:
The maintenance routines for silences and nflog both retrieve and update the same state concurrently. Consider centralizing or serializing state updates to prevent potential lost updates.
Reason this comment was not posted:
Marked as duplicate.

6. pkg/alertmanager/server/server.go:160

Draft comment:
Similar to the silences maintenance, the nflog maintenance routine performs a read-modify-write on the same state object. This duplication increases the risk of race conditions. Consider refactoring to update fields independently.
Reason this comment was not posted:
Marked as duplicate.

7. pkg/types/alertmanagertypes/state.go:54

Draft comment:
In the StoreableState.Set switch, no default case is provided. Consider adding a default branch to handle unknown state names, returning an error rather than silently doing nothing.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50%
The StateName type is controlled within this package and only has two possible values. The switch statement handles both cases. Adding a default case would handle a situation that should never occur in practice since StateName values can only be created within this package. The current approach of silently doing nothing for an invalid state is actually consistent with Go's zero-value behavior.
I could be wrong about the encapsulation - maybe there's a way for external code to create StateName values that I'm not seeing. Also, defensive programming might still be valuable even for "impossible" cases.
Looking at the code again, StateName's only field is unexported ('name string'), so external packages cannot create new instances. The two package variables are the only possible values.
The comment should be deleted. The switch statement is correct as-is since it handles all possible StateName values, and the type system prevents creation of invalid states.

8. pkg/types/alertmanagertypes/state.go:52

Draft comment:
Note: The implementation now uses base64.StdEncoding instead of RawStdEncoding. Ensure this encoding change is intentional and does not break backward compatibility with previously stored states.
Reason this comment was not posted:
Marked as duplicate.

Workflow ID: wflow_VdRdCPRcNhm3uH4K

You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

ellipsis-dev

👍 Looks good to me! Incremental review on 1c69201 in 1 minute and 33 seconds

More details

Looked at 168 lines of code in 3 files
Skipped 0 files when reviewing.
Skipped posting 9 drafted comments based on config settings.

1. pkg/alertmanager/alertmanager.go:16

Draft comment:
Interface now extends factory.Service. Ensure all required methods from factory.Service are implemented by service.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50%
None

2. pkg/alertmanager/internalalertmanager/provider.go:41

Draft comment:
Start method runs a blocking ticker loop. Ensure this is intended; consider running it in a separate goroutine if non-blocking startup is desired.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50%
The blocking nature of Start() appears intentional based on the implementation. This is a common pattern for background services - Start() blocks and runs until Stop() is called via context cancellation. The previous implementation also blocked via the SyncFunc. The comment suggests running in a goroutine, but that would just push the responsibility of managing the goroutine to the caller.
I could be wrong about the intended usage pattern of this provider. Maybe callers expect Start() to return immediately.
The implementation follows a standard pattern for background services. If non-blocking behavior was needed, the previous implementation would have had issues too. The clean shutdown handling via context suggests this blocking behavior is intentional.
This comment should be deleted as it questions an intentional and common implementation pattern without strong evidence that it's problematic.

3. pkg/alertmanager/internalalertmanager/provider.go:27

Draft comment:
Provider settings are scoped correctly. Ensure the naming convention used here ('go.signoz.io/signoz/pkg/alertmanager/internalalertmanager') remains consistent across the project.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50%
None

4. pkg/alertmanager/service.go:32

Draft comment:
Removal of SyncFunc usage in Service construction simplifies startup. Verify that the new pattern of triggering SyncServers externally meets your synchronization requirements.
Reason this comment was not posted:
Confidence changes required: 33% <= threshold 50%
None

5. pkg/alertmanager/alertmanager.go:16

Draft comment:
Embedded factory.Service was added to the Alertmanager interface. Ensure that the implementing provider correctly satisfies all methods required by factory.Service.
Reason this comment was not posted:
Confidence changes required: 33% <= threshold 50%
None

6. pkg/alertmanager/internalalertmanager/provider.go:41

Draft comment:
The Start method contains an infinite loop with a ticker. Make sure this is run in a separate goroutine so it does not block main execution.
Reason this comment was not posted:
Marked as duplicate.

7. pkg/alertmanager/internalalertmanager/provider.go:49

Draft comment:
SyncServers errors in the Start loop are currently only logged. Consider whether additional error handling (or metrics) is needed for repeated failures.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50%
The comment raises a valid concern about error handling robustness. However, it's speculative ("Consider whether...") and doesn't definitively state that additional error handling is required. The current error logging approach might be intentional and sufficient for the use case. Without more context about SyncServers' failure modes and system requirements, we can't be certain additional handling is needed.
The comment identifies a potential improvement in error handling, which could be important for system reliability. Not having metrics for sync failures could make it harder to monitor system health.
While the concern is valid, the comment is too speculative and doesn't provide clear direction. It asks the author to "consider" something rather than making a specific recommendation.
The comment should be removed because it's speculative and doesn't provide clear, actionable guidance. It violates the rule against asking authors to verify or consider things.

8. pkg/alertmanager/service.go:52

Draft comment:
In SyncServers, the mutex is manually unlocked after processing all orgIDs. Consider using 'defer' immediately after Lock to ensure the mutex is released even if an early return or panic occurs.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

9. pkg/alertmanager/service.go:60

Draft comment:
Creation of new servers and calling SetConfig occurs while holding the mutex. If these operations are time-consuming, consider performing them outside the critical section to reduce lock contention.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

Workflow ID: wflow_mIO4cCOFHWe94AHm

You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

github-actions bot added the enhancement New feature or request label Feb 17, 2025

ellipsis-dev bot reviewed Feb 17, 2025

View reviewed changes

pkg/alertmanager/internalalertmanager/provider.go Outdated Show resolved Hide resolved

grandwizard28 added 5 commits February 17, 2025 23:37

feat(alertmanager): add api

c59e913

feat(sqlstore): add logging hook

20b5e64

feat(service): add internal alertmanager

c7c3b33

feat(api): integrate with new authtypes package

f51cf94

ci(main): merge with main

00ddc81

grandwizard28 force-pushed the service/alertmanager branch from 7fd4e18 to 00ddc81 Compare February 17, 2025 18:09

ellipsis-dev bot reviewed Feb 17, 2025

View reviewed changes

feat(alertmanager): move to 3nf

831d81e

ellipsis-dev bot reviewed Feb 17, 2025

View reviewed changes

feat(alertmanager): complete internal provider

1c69201

ellipsis-dev bot reviewed Feb 17, 2025

View reviewed changes

grandwizard28 requested a review from nityanandagohain February 17, 2025 19:54

nityanandagohain approved these changes Feb 18, 2025

View reviewed changes

Merge branch 'main' into service/alertmanager

9b59c05

grandwizard28 enabled auto-merge (squash) February 18, 2025 07:29

grandwizard28 merged commit 918c894 into main Feb 18, 2025
16 of 17 checks passed

grandwizard28 deleted the service/alertmanager branch February 18, 2025 07:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(alertmanager): add service for alertmanager #7136

feat(alertmanager): add service for alertmanager #7136

grandwizard28 commented Feb 17, 2025 •

edited

Loading

ellipsis-dev bot left a comment

ellipsis-dev bot left a comment

ellipsis-dev bot left a comment

ellipsis-dev bot left a comment

feat(alertmanager): add service for alertmanager #7136

feat(alertmanager): add service for alertmanager #7136

Conversation

grandwizard28 commented Feb 17, 2025 • edited Loading

Summary

Related Issues / PR's

Screenshots

Affected Areas and Manually Tested Areas

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

grandwizard28 commented Feb 17, 2025 •

edited

Loading