Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MM-60020: Cloudwatch exporter #798

Merged
merged 20 commits into from
Sep 23, 2024
Merged

Conversation

agarciamontoro
Copy link
Member

@agarciamontoro agarciamontoro commented Sep 12, 2024

Summary

This PR adds a new service to the metrics instance: YACE, an exporter for retrieving Cloudwatch metrics into Prometheus. I needed to make a couple of adjustments to make this work:

  • I added a new ClusterName tag to all resources, since YACE only discovers tagged resources. This will also help us analyze the cost of individual resources and whole clusters after we run large load-tests
  • I also added a missing Name tag to RDS clusters and instances
  • I added a couple of new panels in the default and the Elasticsearch dashboards (see screenshots below)
  • I converted the old dashboard_data.json file into a template file, and renamed it to default_dashboard_tmpl.json

A couple of important issues to note with Cloudwatch metrics:

  • These metrics are inherently delayed due to how AWS export them.
  • Scraping the Cloudwatch API is not free: $0.01 per 1,000 metrics requested (see the Metrics tab in this doc, the entry for GetMetricData).

Considering these two points, YACE is right now configured to get a single data point every 5 minutes, with a 5 minute delay. That is: YACE scrapes Cloudwatch at times T, T+5m, T+10m... and, respectively, get data from T-5m, T, T+5m... This is all configured with the Period, Length and Delay options, which are all set to 300 seconds. This issue comment is very helpful to understand how that works. We can tune this in the future, probably having more resolution, but I wanted to be conservative at the beginning.

Snapshots

New section and panels in default dashboard

image

New section and panels in Elasticsearch dashboard

image

New panel at the very beginning for instance health checks

image

Ticket Link

https://mattermost.atlassian.net/browse/MM-60020

@@ -198,17 +262,21 @@ EOF
resource "aws_elasticache_cluster" "redis_server" {
cluster_id = "${var.cluster_name}-redis"
engine = "redis"
node_type = "${var.redis_node_type}"
node_type = var.redis_node_type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing my mistakes 🙏

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was actually my editor 😂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you emacs 🙏

Copy link
Member

@agnivade agnivade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Left a non-blocking comment for your consideration.

Comment on lines 411 to 445
- type: AWS/EC2
regions:
- us-east-1
period: {{.Period}}
length: {{.Length}}
delay: {{.Delay}}
addCloudwatchTimestamp: true
searchTags:
- key: ClusterName
value: {{.ClusterName}}
metrics:
- name: CPUUtilization
statistics: [Average]
- name: NetworkIn
statistics: [Average, Sum]
- name: NetworkOut
statistics: [Average, Sum]
- name: NetworkPacketsIn
statistics: [Sum]
- name: NetworkPacketsOut
statistics: [Sum]
- name: DiskReadBytes
statistics: [Sum]
- name: DiskWriteBytes
statistics: [Sum]
- name: DiskReadOps
statistics: [Sum]
- name: DiskWriteOps
statistics: [Sum]
- name: StatusCheckFailed
statistics: [Sum]
- name: StatusCheckFailed_Instance
statistics: [Sum]
- name: StatusCheckFailed_System
statistics: [Sum]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need the EC2 metrics. 0/5

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I added those when I was testing the whole thing and did not remove them even though I'm not using them in any of the panels. The only interesting ones may be the StatusCheckFailed ones to monitor if an instance is down. Let me see if I can add a panel with that info and I can remove the other ones.

@agarciamontoro
Copy link
Member Author

agarciamontoro commented Sep 16, 2024

Added a new panel plotting the health check of all instances:
image

Copy link
Member

@cpoile cpoile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what I can add, looks good. :)

@agarciamontoro agarciamontoro merged commit 3afe53d into master Sep 23, 2024
1 check passed
@agarciamontoro agarciamontoro deleted the MM-60020.cloudwatch.exporter branch September 23, 2024 13:44
Comment on lines +104 to +107
resource "aws_iam_instance_profile" "metrics_profile" {
name = "metrics_profile"
role = aws_iam_role.metrics_role.name
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agarciamontoro Heads up this seems to be causing issues on my side, apparently conflicting with other deployments. Getting the following error when deploying:

info  [2024-09-25 15:49:43.890 -06:00] │ Error: creating IAM Instance Profile (metrics_profile): operation error IAM: CreateInstanceProfile, https response error StatusCode: 409, RequestID: 5a5d50ea-fa2d-411b-b185-2009a4975393, EntityAlreadyExists: Instance Profile metrics_profile already exists. caller="terraform/engine.go:97"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course! 🤦 Thank you for catching this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants