Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fails for cloudformation stacks with large number of alarms #207

Closed
tomhaigh opened this issue Oct 10, 2018 · 3 comments
Closed

Fails for cloudformation stacks with large number of alarms #207

tomhaigh opened this issue Oct 10, 2018 · 3 comments
Labels

Comments

@tomhaigh
Copy link
Contributor

tomhaigh commented Oct 10, 2018

There are some AWS hard limits around maximum cloudformation template size and number of resources per stack.
Currently some of our alerting config groups contain large numbers of alarms and so we are starting to see errors. Manually sharding alarms across multiple config groups is possible but is a pain.

Proposed solution

  • Add a configuration setting at alerting group level such as NumberOfCloudFormationStacks. This would default to 1.
  • use some deterministic method to bucket the alarms across stacks. e.g. the alarm name checksum % NumberOfCloudFormationStacks
  • deploy N stacks with the alarms spread across as per above e.g. aws-watchman-[alerting group name]-stack number. perhaps the first stack could keep the existing name to maintain compatibility with stacks already deployed, and then subsequent stacks are numbered?

I've thought a bit about how we could automate this (i.e. figuring out the number of stacks automatically), but it's difficult because the number of alarms can go up and down, so it seems like you would need some extra state to make sure we didn't get orphaned cloudformation stacks (maybe just listing stacks would be enough). But it seems complicated and the above solution seems like a reasonable start.

@tomhaigh tomhaigh added the bug label Oct 10, 2018
@stephenthrelfall
Copy link
Contributor

The main question I have is what happens if you switch between numbers of stacks? Would you tear down any previous stacks and recreate everything, because otherwise you could find some alarms moving between stacks and being duplicated/orphaned.

@tomhaigh
Copy link
Contributor Author

If you increased the number of stacks everything would just work - e.g. if you go from 1 to 2, then half the alarms would be put into the 2nd stack, and CloudFormation would handle deleting them once they had gone from the 1st.

If you decreased the number of stacks it would work (everything would get squashed into fewer stacks), but you would need to manually delete the stacks above the new "NumberOfCloudFormationStacks" value

@stephenthrelfall
Copy link
Contributor

Ah yes, that makes sense. In that case this change is probably more straightforward than I thought.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants