Description
Currently there's one table, configs
, keyed by (id
, orgID
), where id
is a version number. The value is a JSON blob which has both the alertmanager config and the rules. Further, although there are separate endpoints, the implementations behind these endpoints are identical.
Instead, the config systems for alertmanager and ruler should be fully disjoint.
This means we should have:
- one table with rules
- one table with alertmanager configs
- endpoints for alertmanager (that have nothing to do with ruler) for
- getting a config
- setting a config
- getting all configs
- getting all configs since a version
- likewise endpoints for ruler configs
We should also separate the client libraries, embedding them in their respective call-sites (move ruler config client to ruler, alertmanager config client to alertmanager).
This will need to be multiple PRs in order to handle backwards compatibility between UI <-> configs <-> DB.
At a guess:
- New endpoints
- create new endpoints with desired behaviour
- update clients in cortex to use & rely on new behaviour
- still write to existing conflated DB structure
- old endpoints still work as intended
- Update Weave Cloud UI to use new endpoints
- Remove old endpoints
- New tables
- create new tables
- write to new tables
- read from new tables and fall back to old
- Migrate old data
- wait until data only being written to new tables
- migrate old data to new tables
- Remove old tables
- remove old tables
- remove fallback code
LMK if that needs refining (can we do it in fewer steps? would even these steps have backwards compat problems?).
I'd kind of like to kill off the configs service completely by moving the endpoints to alertmanager and ruler respectively, and having both of those services connect directly to the database. The most convenient place to make that decision is in PR 1 above, since we'll have to update UI endpoints. It will also require more PRs for juggling flags to our deployed versions.
Other stuff that we might want to consider:
- use gRPC rather than REST
- not a massive advantage, as we would still have to re-parse configs at client side, because the configs will still be strings
- since we'll want new endpoints, might as well have new endpoints with better protocol
- probably don't want to use gRPC if we're moving config endpoints to ruler & alertmanager, because we'll need to talk to them from JS anyway
- could theoretically split DBs, or even move to different DB backend (e.g. DynamoDB)
- not under this GH issue