-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MM-60651: Add Loki to collect app, agents, and proxy logs #818
Conversation
- Install Loki in the metrics instance - Add a new rule to the metrics instance security group accepting traffic from anywhere to port :3100, where Loki will listen for logs data. - Add a new datasource to Grafana so that we can query Loki from the unified UI.
- Install the OpenTelemtry collector (the -contrib version, which is the one including the filelog receiver we use) in every app node. - Add a configuration file templated with the files to be collected, the name and ID of the service, the metrics instance IP, and the specific operator. - Add an operator for the app and agent logs, which parses the JSON lines, extracting the timestamp and severity from the timestamp and level fields respectively. - Configure the OpenTelemetry collector using the configuration file template and corresponding operator, updating the signature of the setup{App,MM,Job}Server functions to pass the instance name.
- Apply the same steps as in the previous commit, but this time collecting the agents' and coordinators' logs.
- Same as the previous two commits. - Update the signature of setupProxyServer function to receive the whole instance object, which gives access to the instance name. - Collect the nginx error log file, which is no longer JSON as the app and agent logs, but a plain-text line that we parse using a new regex-based operator, which parses the timestamp and severity as well from the specified capturing groups.
Add a new Logs overview row containing five panels: 1. A list of all the raw logs, pretty-printed for easily spotting patterns. 2. A count of log lines by level for the coordinator. 3. A count of log lines by level for the agents. 4. A count of log lines by level for the app nodes. 5. A count of log lines by level for the proxy nodes. The log-count panels contain a hack: we need a dummy query (we use `sum(vector(0))`) for Grafana to show the whole time range selected. Otherwise, those panels automatically zoom to the smallest time range where there is at least one log line. This solution is copied from https://community.grafana.com/t/empty-values-not-shown/76173/26
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, can't wait to try it :)
deployment/terraform/agent.go
Outdated
"IncludeFiles": strings.Join([]string{ | ||
"/home/ubuntu/mattermost-load-test-ng/ltagent.log", | ||
"/home/ubuntu/mattermost-load-test-ng/ltcoordinator.log", | ||
}, ", "), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these logs (agent and coordinator) get interleaved or is there an easy way to filter one or the other? In other words, I'd expect a separate service name for coordinator's logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! You can filter them by log name (the new panels use that filtering). Do you think that's enough or would you prefer to have a different service name altogether?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as we can easily discriminate through filenames, I think we are okay. However, on a conceptual level, they are two separate services in my mind, as they are different binaries when not using the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if a log file gets rolled over to file.log.1
? Is that handled as well? Not trying to increase the scope here, but a service name is a simpler abstraction to keep in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What the receiver does is tail
ing the file specified, so I don't think rolled logs is a problem. I'll see how easy it is to add another service to that, it makes sense logically
@@ -572,6 +572,16 @@ resource "aws_security_group_rule" "metrics-pyroscope" { | |||
security_group_id = aws_security_group.metrics[0].id | |||
} | |||
|
|||
resource "aws_security_group_rule" "metrics-loki" { | |||
count = var.app_instance_count > 0 ? 1 : 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conceptually, I'd expect Loki to be needed as long as there's a metrics instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, let me check that instead
# Install Loki | ||
mkdir -p /etc/apt/keyrings/ && \ | ||
sudo bash -c 'wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor > /etc/apt/keyrings/grafana.gpg' && \ | ||
sudo bash -c 'echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | tee /etc/apt/sources.list.d/grafana.list' && \ | ||
sudo apt-get -y update && \ | ||
sudo apt-get install -y loki && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if it would be best to follow what we did for other Grafana packages and install from the .deb
so we can pin the version and avoid surprises.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, will take a look
"selected": false, | ||
"text": "agent-0:4000", | ||
"value": "agent-0:4000" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably keep these empty, as before.
Sorry for the delay here. Will review tomorrow. |
deployment/config.go
Outdated
@@ -121,6 +121,8 @@ type Config struct { | |||
// CustomTags is an optional list of key-value pairs, which will be used as default | |||
// tags for all resources deployed | |||
CustomTags TerraformMap | |||
// Type of the EC2 instance for metrics. | |||
MetricsInstanceType string `default:"t3.xlarge" validate:"notempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we declare this slightly up along with AppInstanceType
and AgentInstanceType
?
config/deployer.sample.json
Outdated
@@ -147,5 +147,6 @@ | |||
"BlockProfileRate": 0 | |||
}, | |||
"CustomTags": { | |||
} | |||
}, | |||
"MetricsInstanceType": "t3.xlarge" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's declare this up with the other instanceType declarations.
deployment/terraform/agent.go
Outdated
"IncludeFiles": strings.Join([]string{ | ||
"/home/ubuntu/mattermost-load-test-ng/ltagent.log", | ||
"/home/ubuntu/mattermost-load-test-ng/ltcoordinator.log", | ||
}, ", "), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if a log file gets rolled over to file.log.1
? Is that handled as well? Not trying to increase the scope here, but a service name is a simpler abstraction to keep in mind.
sudo sed -i 's/User=.*/User=ubuntu/g' /lib/systemd/system/otelcol-contrib.service && \ | ||
sudo sed -i 's/Group=.*/Group=ubuntu/g' /lib/systemd/system/otelcol-contrib.service && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks brittle. Would it be worth having our own service file and just copy-pasting it over to the instance? 0/5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say that having our own service file copied brings other forms of brittleness to the table, so I'd keep this as is.
Also, refactor test to make it testable. And test it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Summary
This PR adds Loki to the metrics instance in order to centralize log analysis. The actual collection of the logs from the app, agent and proxy nodes happen using the OpenTelemetry collector, specifically the filelog receiver, that is configured to retrieve the corresponding logs in each instance.
This PR also adds a new section to our default dashboard with panels listing all log lines and summarizing their count in each instance type by level, as shown in the screenshot (the proxy errors were forced by setting worker_connections to 500 😈).
Please, review by commit, I've tried to isolate the changes and describe them in each commit message :)
Ticket Link
https://mattermost.atlassian.net/browse/MM-60651