-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MM-60020: Cloudwatch exporter #798
Conversation
This allows the metrics instance to access accounts on behalf of the AWS account that created it, with the permissions specified in the metrics_policy_document data block
A good resource on how to configure the delay, period and length: prometheus-community/yet-another-cloudwatch-exporter#865 (comment)
0b9d31e
to
ffdd6f5
Compare
@@ -198,17 +262,21 @@ EOF | |||
resource "aws_elasticache_cluster" "redis_server" { | |||
cluster_id = "${var.cluster_name}-redis" | |||
engine = "redis" | |||
node_type = "${var.redis_node_type}" | |||
node_type = var.redis_node_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing my mistakes 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was actually my editor 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you emacs 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Left a non-blocking comment for your consideration.
deployment/terraform/strings.go
Outdated
- type: AWS/EC2 | ||
regions: | ||
- us-east-1 | ||
period: {{.Period}} | ||
length: {{.Length}} | ||
delay: {{.Delay}} | ||
addCloudwatchTimestamp: true | ||
searchTags: | ||
- key: ClusterName | ||
value: {{.ClusterName}} | ||
metrics: | ||
- name: CPUUtilization | ||
statistics: [Average] | ||
- name: NetworkIn | ||
statistics: [Average, Sum] | ||
- name: NetworkOut | ||
statistics: [Average, Sum] | ||
- name: NetworkPacketsIn | ||
statistics: [Sum] | ||
- name: NetworkPacketsOut | ||
statistics: [Sum] | ||
- name: DiskReadBytes | ||
statistics: [Sum] | ||
- name: DiskWriteBytes | ||
statistics: [Sum] | ||
- name: DiskReadOps | ||
statistics: [Sum] | ||
- name: DiskWriteOps | ||
statistics: [Sum] | ||
- name: StatusCheckFailed | ||
statistics: [Sum] | ||
- name: StatusCheckFailed_Instance | ||
statistics: [Sum] | ||
- name: StatusCheckFailed_System | ||
statistics: [Sum] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need the EC2 metrics. 0/5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I added those when I was testing the whole thing and did not remove them even though I'm not using them in any of the panels. The only interesting ones may be the StatusCheckFailed
ones to monitor if an instance is down. Let me see if I can add a panel with that info and I can remove the other ones.
Set nilToZero to true as well to get metrics for all instances, since 0 is the success value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what I can add, looks good. :)
resource "aws_iam_instance_profile" "metrics_profile" { | ||
name = "metrics_profile" | ||
role = aws_iam_role.metrics_role.name | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@agarciamontoro Heads up this seems to be causing issues on my side, apparently conflicting with other deployments. Getting the following error when deploying:
info [2024-09-25 15:49:43.890 -06:00] │ Error: creating IAM Instance Profile (metrics_profile): operation error IAM: CreateInstanceProfile, https response error StatusCode: 409, RequestID: 5a5d50ea-fa2d-411b-b185-2009a4975393, EntityAlreadyExists: Instance Profile metrics_profile already exists. caller="terraform/engine.go:97"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course! 🤦 Thank you for catching this!
Summary
This PR adds a new service to the metrics instance: YACE, an exporter for retrieving Cloudwatch metrics into Prometheus. I needed to make a couple of adjustments to make this work:
ClusterName
tag to all resources, since YACE only discovers tagged resources. This will also help us analyze the cost of individual resources and whole clusters after we run large load-testsName
tag to RDS clusters and instancesdashboard_data.json
file into a template file, and renamed it todefault_dashboard_tmpl.json
A couple of important issues to note with Cloudwatch metrics:
GetMetricData
).Considering these two points, YACE is right now configured to get a single data point every 5 minutes, with a 5 minute delay. That is: YACE scrapes Cloudwatch at times T, T+5m, T+10m... and, respectively, get data from T-5m, T, T+5m... This is all configured with the
Period
,Length
andDelay
options, which are all set to 300 seconds. This issue comment is very helpful to understand how that works. We can tune this in the future, probably having more resolution, but I wanted to be conservative at the beginning.Snapshots
New section and panels in default dashboard
New section and panels in Elasticsearch dashboard
New panel at the very beginning for instance health checks
Ticket Link
https://mattermost.atlassian.net/browse/MM-60020