MM-60651: Add Loki to collect app, agents, and proxy logs #818

agarciamontoro · 2024-10-04T18:01:29Z

Summary

This PR adds Loki to the metrics instance in order to centralize log analysis. The actual collection of the logs from the app, agent and proxy nodes happen using the OpenTelemetry collector, specifically the filelog receiver, that is configured to retrieve the corresponding logs in each instance.

This PR also adds a new section to our default dashboard with panels listing all log lines and summarizing their count in each instance type by level, as shown in the screenshot (the proxy errors were forced by setting worker_connections to 500 😈).

Please, review by commit, I've tried to isolate the changes and describe them in each commit message :)

Ticket Link

https://mattermost.atlassian.net/browse/MM-60651

- Install Loki in the metrics instance - Add a new rule to the metrics instance security group accepting traffic from anywhere to port :3100, where Loki will listen for logs data. - Add a new datasource to Grafana so that we can query Loki from the unified UI.

- Install the OpenTelemtry collector (the -contrib version, which is the one including the filelog receiver we use) in every app node. - Add a configuration file templated with the files to be collected, the name and ID of the service, the metrics instance IP, and the specific operator. - Add an operator for the app and agent logs, which parses the JSON lines, extracting the timestamp and severity from the timestamp and level fields respectively. - Configure the OpenTelemetry collector using the configuration file template and corresponding operator, updating the signature of the setup{App,MM,Job}Server functions to pass the instance name.

- Apply the same steps as in the previous commit, but this time collecting the agents' and coordinators' logs.

- Same as the previous two commits. - Update the signature of setupProxyServer function to receive the whole instance object, which gives access to the instance name. - Collect the nginx error log file, which is no longer JSON as the app and agent logs, but a plain-text line that we parse using a new regex-based operator, which parses the timestamp and severity as well from the specified capturing groups.

Add a new Logs overview row containing five panels: 1. A list of all the raw logs, pretty-printed for easily spotting patterns. 2. A count of log lines by level for the coordinator. 3. A count of log lines by level for the agents. 4. A count of log lines by level for the app nodes. 5. A count of log lines by level for the proxy nodes. The log-count panels contain a hack: we need a dummy query (we use `sum(vector(0))`) for Grafana to show the whole time range selected. Otherwise, those panels automatically zoom to the smallest time range where there is at least one log line. This solution is copied from https://community.grafana.com/t/empty-values-not-shown/76173/26

streamer45

Great work, can't wait to try it :)

streamer45 · 2024-10-07T14:47:41Z

deployment/terraform/agent.go

+				"IncludeFiles": strings.Join([]string{
+					"/home/ubuntu/mattermost-load-test-ng/ltagent.log",
+					"/home/ubuntu/mattermost-load-test-ng/ltcoordinator.log",
+				}, ", "),


Do these logs (agent and coordinator) get interleaved or is there an easy way to filter one or the other? In other words, I'd expect a separate service name for coordinator's logs.

Good point! You can filter them by log name (the new panels use that filtering). Do you think that's enough or would you prefer to have a different service name altogether?

As long as we can easily discriminate through filenames, I think we are okay. However, on a conceptual level, they are two separate services in my mind, as they are different binaries when not using the API.

What happens if a log file gets rolled over to file.log.1? Is that handled as well? Not trying to increase the scope here, but a service name is a simpler abstraction to keep in mind.

What the receiver does is tailing the file specified, so I don't think rolled logs is a problem. I'll see how easy it is to add another service to that, it makes sense logically

streamer45 · 2024-10-07T14:49:40Z

deployment/terraform/assets/cluster.tf

@@ -572,6 +572,16 @@ resource "aws_security_group_rule" "metrics-pyroscope" {
  security_group_id = aws_security_group.metrics[0].id
 }

+resource "aws_security_group_rule" "metrics-loki" {
+  count             = var.app_instance_count > 0 ? 1 : 0


Conceptually, I'd expect Loki to be needed as long as there's a metrics instance.

That makes sense, let me check that instead

streamer45 · 2024-10-07T14:51:22Z

deployment/terraform/assets/provisioners/metrics.sh

+      # Install Loki
+      mkdir -p /etc/apt/keyrings/ && \
+      sudo bash -c 'wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor > /etc/apt/keyrings/grafana.gpg' && \
+      sudo bash -c 'echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | tee /etc/apt/sources.list.d/grafana.list' && \
+      sudo apt-get -y update && \
+      sudo apt-get install -y loki && \


Wondering if it would be best to follow what we did for other Grafana packages and install from the .deb so we can pin the version and avoid surprises.

Good point, will take a look

streamer45 · 2024-10-07T14:54:56Z

deployment/terraform/assets/default_dashboard_tmpl.json

+          "selected": false,
+          "text": "agent-0:4000",
+          "value": "agent-0:4000"


I'd probably keep these empty, as before.

agnivade · 2024-10-07T16:47:31Z

Sorry for the delay here. Will review tomorrow.

agnivade · 2024-10-08T06:57:33Z

deployment/config.go

@@ -121,6 +121,8 @@ type Config struct {
 	// CustomTags is an optional list of key-value pairs, which will be used as default
 	// tags for all resources deployed
 	CustomTags TerraformMap
+	// Type of the EC2 instance for metrics.
+	MetricsInstanceType string `default:"t3.xlarge" validate:"notempty"`


Can we declare this slightly up along with AppInstanceType and AgentInstanceType?

agnivade · 2024-10-08T06:57:55Z

config/deployer.sample.json

@@ -147,5 +147,6 @@
    "BlockProfileRate": 0
  },
  "CustomTags": {
-  }
+  },
+  "MetricsInstanceType": "t3.xlarge"


Let's declare this up with the other instanceType declarations.

agnivade · 2024-10-08T06:59:29Z

deployment/terraform/agent.go

+				"IncludeFiles": strings.Join([]string{
+					"/home/ubuntu/mattermost-load-test-ng/ltagent.log",
+					"/home/ubuntu/mattermost-load-test-ng/ltcoordinator.log",
+				}, ", "),


What happens if a log file gets rolled over to file.log.1? Is that handled as well? Not trying to increase the scope here, but a service name is a simpler abstraction to keep in mind.

agnivade · 2024-10-08T07:02:57Z

deployment/terraform/assets/provisioners/agent.sh

+        sudo sed -i 's/User=.*/User=ubuntu/g' /lib/systemd/system/otelcol-contrib.service && \
+        sudo sed -i 's/Group=.*/Group=ubuntu/g' /lib/systemd/system/otelcol-contrib.service && \


This looks brittle. Would it be worth having our own service file and just copy-pasting it over to the instance? 0/5

I'd say that having our own service file copied brings other forms of brittleness to the table, so I'd keep this as is.

Also, refactor test to make it testable. And test it.

agnivade

Thanks!

agarciamontoro added 7 commits October 4, 2024 12:29

Install Loki

2417bec

- Install Loki in the metrics instance - Add a new rule to the metrics instance security group accepting traffic from anywhere to port :3100, where Loki will listen for logs data. - Add a new datasource to Grafana so that we can query Loki from the unified UI.

Install and config OTel collector in agent nodes

2642861

- Apply the same steps as in the previous commit, but this time collecting the agents' and coordinators' logs.

Parametrize the metrics instance type

a878bc3

make assets

1991652

agarciamontoro added the 2: Dev Review Requires review by a core committer label Oct 4, 2024

agarciamontoro requested review from agnivade and streamer45 October 4, 2024 18:01

streamer45 approved these changes Oct 7, 2024

View reviewed changes

agnivade reviewed Oct 8, 2024

View reviewed changes

agarciamontoro added 10 commits October 9, 2024 14:10

Assign different services to agent/coordinator

1b099f5

Also, refactor test to make it testable. And test it.

Accept any value in data map in fillConfigTemplate

855542c

Simplify otelcolConfigTmpl

83ac580

Test the new renderXYZOtelcolConfig functions

f1cd3bf

Deploy Loki sec group if there is a metrics server

ec7e796

Move MetricsInstanceType closer to other types

bd7865d

Revert unwanted dashboard changes

0d179d5

Use the new service_name in queries

851acce

Install Loki through the deb file from Github

da18197

make assets

0569a92

agarciamontoro requested a review from agnivade October 9, 2024 17:18

agnivade approved these changes Oct 10, 2024

View reviewed changes

agnivade added 4: Reviews Complete All reviewers have approved the pull request and removed 2: Dev Review Requires review by a core committer labels Oct 10, 2024

agarciamontoro merged commit b3fceaf into master Oct 10, 2024
1 check passed

agarciamontoro deleted the MM-60651.loki branch October 10, 2024 08:00

agarciamontoro mentioned this pull request Oct 10, 2024

Use app_instance_count as proxy for metrics node #825

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MM-60651: Add Loki to collect app, agents, and proxy logs #818

MM-60651: Add Loki to collect app, agents, and proxy logs #818

agarciamontoro commented Oct 4, 2024

streamer45 left a comment

streamer45 Oct 7, 2024

agarciamontoro Oct 7, 2024 •

edited

Loading

streamer45 Oct 7, 2024

agnivade Oct 8, 2024

agarciamontoro Oct 9, 2024

streamer45 Oct 7, 2024

agarciamontoro Oct 7, 2024

streamer45 Oct 7, 2024

agarciamontoro Oct 7, 2024

streamer45 Oct 7, 2024

agnivade commented Oct 7, 2024

agnivade Oct 8, 2024

agnivade Oct 8, 2024

agnivade Oct 8, 2024

agnivade Oct 8, 2024

agarciamontoro Oct 9, 2024

agnivade left a comment

		sudo sed -i 's/User=.*/User=ubuntu/g' /lib/systemd/system/otelcol-contrib.service && \
		sudo sed -i 's/Group=.*/Group=ubuntu/g' /lib/systemd/system/otelcol-contrib.service && \

MM-60651: Add Loki to collect app, agents, and proxy logs #818

MM-60651: Add Loki to collect app, agents, and proxy logs #818

Conversation

agarciamontoro commented Oct 4, 2024

Summary

Ticket Link

streamer45 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agarciamontoro Oct 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agnivade commented Oct 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agnivade left a comment

Choose a reason for hiding this comment

agarciamontoro Oct 7, 2024 •

edited

Loading