Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MM-60651: Add Loki to collect app, agents, and proxy logs #818

Merged
merged 17 commits into from
Oct 10, 2024

Conversation

agarciamontoro
Copy link
Member

Summary

This PR adds Loki to the metrics instance in order to centralize log analysis. The actual collection of the logs from the app, agent and proxy nodes happen using the OpenTelemetry collector, specifically the filelog receiver, that is configured to retrieve the corresponding logs in each instance.

This PR also adds a new section to our default dashboard with panels listing all log lines and summarizing their count in each instance type by level, as shown in the screenshot (the proxy errors were forced by setting worker_connections to 500 😈).

Please, review by commit, I've tried to isolate the changes and describe them in each commit message :)

image

Ticket Link

https://mattermost.atlassian.net/browse/MM-60651

- Install Loki in the metrics instance
- Add a new rule to the metrics instance security group accepting
  traffic from anywhere to port :3100, where Loki will listen for logs
  data.
- Add a new datasource to Grafana so that we can query Loki from the
  unified UI.
- Install the OpenTelemtry collector (the -contrib version, which is the
  one including the filelog receiver we use) in every app node.
- Add a configuration file templated with the files to be collected, the
  name and ID of the service, the metrics instance IP, and the specific
  operator.
- Add an operator for the app and agent logs, which parses the JSON
  lines, extracting the timestamp and severity from the timestamp and
  level fields respectively.
- Configure the OpenTelemetry collector using the configuration file
  template and corresponding operator, updating the signature of the
  setup{App,MM,Job}Server functions to pass the instance name.
- Apply the same steps as in the previous commit, but this time
  collecting the agents' and coordinators' logs.
- Same as the previous two commits.
- Update the signature of setupProxyServer function to receive the whole
  instance object, which gives access to the instance name.
- Collect the nginx error log file, which is no longer JSON as the app
  and agent logs, but a plain-text line that we parse using a new
  regex-based operator, which parses the timestamp and severity as well
  from the specified capturing groups.
Add a new Logs overview row containing five panels:
1. A list of all the raw logs, pretty-printed for easily spotting
   patterns.
2. A count of log lines by level for the coordinator.
3. A count of log lines by level for the agents.
4. A count of log lines by level for the app nodes.
5. A count of log lines by level for the proxy nodes.

The log-count panels contain a hack: we need a dummy query (we use
`sum(vector(0))`) for Grafana to show the whole time range selected.
Otherwise, those panels automatically zoom to the smallest time range
where there is at least one log line. This solution is copied from
https://community.grafana.com/t/empty-values-not-shown/76173/26
@agarciamontoro agarciamontoro added the 2: Dev Review Requires review by a core committer label Oct 4, 2024
Copy link
Contributor

@streamer45 streamer45 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, can't wait to try it :)

Comment on lines 143 to 146
"IncludeFiles": strings.Join([]string{
"/home/ubuntu/mattermost-load-test-ng/ltagent.log",
"/home/ubuntu/mattermost-load-test-ng/ltcoordinator.log",
}, ", "),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these logs (agent and coordinator) get interleaved or is there an easy way to filter one or the other? In other words, I'd expect a separate service name for coordinator's logs.

Copy link
Member Author

@agarciamontoro agarciamontoro Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! You can filter them by log name (the new panels use that filtering). Do you think that's enough or would you prefer to have a different service name altogether?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we can easily discriminate through filenames, I think we are okay. However, on a conceptual level, they are two separate services in my mind, as they are different binaries when not using the API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if a log file gets rolled over to file.log.1? Is that handled as well? Not trying to increase the scope here, but a service name is a simpler abstraction to keep in mind.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What the receiver does is tailing the file specified, so I don't think rolled logs is a problem. I'll see how easy it is to add another service to that, it makes sense logically

@@ -572,6 +572,16 @@ resource "aws_security_group_rule" "metrics-pyroscope" {
security_group_id = aws_security_group.metrics[0].id
}

resource "aws_security_group_rule" "metrics-loki" {
count = var.app_instance_count > 0 ? 1 : 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually, I'd expect Loki to be needed as long as there's a metrics instance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, let me check that instead

Comment on lines 44 to 49
# Install Loki
mkdir -p /etc/apt/keyrings/ && \
sudo bash -c 'wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor > /etc/apt/keyrings/grafana.gpg' && \
sudo bash -c 'echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | tee /etc/apt/sources.list.d/grafana.list' && \
sudo apt-get -y update && \
sudo apt-get install -y loki && \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if it would be best to follow what we did for other Grafana packages and install from the .deb so we can pin the version and avoid surprises.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will take a look

Comment on lines 3853 to 3855
"selected": false,
"text": "agent-0:4000",
"value": "agent-0:4000"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably keep these empty, as before.

@agnivade
Copy link
Member

agnivade commented Oct 7, 2024

Sorry for the delay here. Will review tomorrow.

@@ -121,6 +121,8 @@ type Config struct {
// CustomTags is an optional list of key-value pairs, which will be used as default
// tags for all resources deployed
CustomTags TerraformMap
// Type of the EC2 instance for metrics.
MetricsInstanceType string `default:"t3.xlarge" validate:"notempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we declare this slightly up along with AppInstanceType and AgentInstanceType?

@@ -147,5 +147,6 @@
"BlockProfileRate": 0
},
"CustomTags": {
}
},
"MetricsInstanceType": "t3.xlarge"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's declare this up with the other instanceType declarations.

Comment on lines 143 to 146
"IncludeFiles": strings.Join([]string{
"/home/ubuntu/mattermost-load-test-ng/ltagent.log",
"/home/ubuntu/mattermost-load-test-ng/ltcoordinator.log",
}, ", "),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if a log file gets rolled over to file.log.1? Is that handled as well? Not trying to increase the scope here, but a service name is a simpler abstraction to keep in mind.

Comment on lines +20 to +21
sudo sed -i 's/User=.*/User=ubuntu/g' /lib/systemd/system/otelcol-contrib.service && \
sudo sed -i 's/Group=.*/Group=ubuntu/g' /lib/systemd/system/otelcol-contrib.service && \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks brittle. Would it be worth having our own service file and just copy-pasting it over to the instance? 0/5

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say that having our own service file copied brings other forms of brittleness to the table, so I'd keep this as is.

Copy link
Member

@agnivade agnivade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@agnivade agnivade added 4: Reviews Complete All reviewers have approved the pull request and removed 2: Dev Review Requires review by a core committer labels Oct 10, 2024
@agarciamontoro agarciamontoro merged commit b3fceaf into master Oct 10, 2024
1 check passed
@agarciamontoro agarciamontoro deleted the MM-60651.loki branch October 10, 2024 08:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4: Reviews Complete All reviewers have approved the pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants