Skip to content

Commit

Permalink
MM-60651: Add Loki to collect app, agents, and proxy logs (#818)
Browse files Browse the repository at this point in the history
* Install Loki

- Install Loki in the metrics instance
- Add a new rule to the metrics instance security group accepting
  traffic from anywhere to port :3100, where Loki will listen for logs
  data.
- Add a new datasource to Grafana so that we can query Loki from the
  unified UI.

* Install and config OTel collector in app nodes

- Install the OpenTelemtry collector (the -contrib version, which is the
  one including the filelog receiver we use) in every app node.
- Add a configuration file templated with the files to be collected, the
  name and ID of the service, the metrics instance IP, and the specific
  operator.
- Add an operator for the app and agent logs, which parses the JSON
  lines, extracting the timestamp and severity from the timestamp and
  level fields respectively.
- Configure the OpenTelemetry collector using the configuration file
  template and corresponding operator, updating the signature of the
  setup{App,MM,Job}Server functions to pass the instance name.

* Install and config OTel collector in agent nodes

- Apply the same steps as in the previous commit, but this time
  collecting the agents' and coordinators' logs.

* Install and config OTel collector in proxy nodes

- Same as the previous two commits.
- Update the signature of setupProxyServer function to receive the whole
  instance object, which gives access to the instance name.
- Collect the nginx error log file, which is no longer JSON as the app
  and agent logs, but a plain-text line that we parse using a new
  regex-based operator, which parses the timestamp and severity as well
  from the specified capturing groups.

* Add logs panel to Grafana dashboard

Add a new Logs overview row containing five panels:
1. A list of all the raw logs, pretty-printed for easily spotting
   patterns.
2. A count of log lines by level for the coordinator.
3. A count of log lines by level for the agents.
4. A count of log lines by level for the app nodes.
5. A count of log lines by level for the proxy nodes.

The log-count panels contain a hack: we need a dummy query (we use
`sum(vector(0))`) for Grafana to show the whole time range selected.
Otherwise, those panels automatically zoom to the smallest time range
where there is at least one log line. This solution is copied from
https://community.grafana.com/t/empty-values-not-shown/76173/26

* Parametrize the metrics instance type

* make assets

* Assign different services to agent/coordinator

Also, refactor test to make it testable. And test it.

* Accept any value in data map in fillConfigTemplate

* Simplify otelcolConfigTmpl

* Test the new renderXYZOtelcolConfig functions

* Deploy Loki sec group if there is a metrics server

* Move MetricsInstanceType closer to other types

* Revert unwanted dashboard changes

* Use the new service_name in queries

* Install Loki through the deb file from Github

* make assets
  • Loading branch information
agarciamontoro authored Oct 10, 2024
1 parent 782ea07 commit b3fceaf
Show file tree
Hide file tree
Showing 20 changed files with 1,515 additions and 90 deletions.
1 change: 1 addition & 0 deletions config/deployer.sample.json
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
"ClusterSubnetID": "",
"AppInstanceCount": 1,
"AppInstanceType": "c7i.xlarge",
"MetricsInstanceType": "t3.xlarge",
"AgentInstanceCount": 2,
"AgentInstanceType": "c7i.xlarge",
"ElasticSearchSettings": {
Expand Down
4 changes: 4 additions & 0 deletions config/deployer.sample.toml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,9 @@ EnableAgentFullLogs = true
AppInstanceCount = 1
AppInstanceType = 'c5.xlarge'

# Metrics instance configuration
MetricsInstanceType = 't3.xlarge'

# Cluster configuration
ClusterName = 'loadtest'
ClusterSubnetID = ''
Expand Down Expand Up @@ -142,3 +145,4 @@ Password = 'mostest80098bigpass_'
UserName = 'mmuser'

[CustomTags]

2 changes: 2 additions & 0 deletions deployment/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,8 @@ type Config struct {
AppInstanceCount int `default:"1" validate:"range:[0,)"`
// Type of the EC2 instance for app.
AppInstanceType string `default:"c7i.xlarge" validate:"notempty"`
// Type of the EC2 instance for metrics.
MetricsInstanceType string `default:"t3.xlarge" validate:"notempty"`
// Number of agents, first agent and coordinator will share the same instance.
AgentInstanceCount int `default:"2" validate:"range:[0,)"`
// Type of the EC2 instance for agent.
Expand Down
15 changes: 15 additions & 0 deletions deployment/terraform/agent.go
Original file line number Diff line number Diff line change
Expand Up @@ -139,11 +139,19 @@ func (t *Terraform) configureAndRunAgents(extAgent *ssh.ExtAgent) error {
buf := bytes.NewBufferString("")
tpl.Execute(buf, tplVars)

otelcolConfig, err := renderAgentOtelcolConfig(instance.Tags.Name, t.output.MetricsServer.PrivateIP)
if err != nil {
mlog.Error("unable to render otelcol config", mlog.Int("agent", agentNumber), mlog.Err(err))
foundErr.Store(true)
return
}

batch := []uploadInfo{
{srcData: strings.TrimPrefix(buf.String(), "\n"), dstPath: "/lib/systemd/system/ltapi.service", msg: "Uploading load-test api service file"},
{srcData: strings.TrimPrefix(clientSysctlConfig, "\n"), dstPath: "/etc/sysctl.conf"},
{srcData: strings.TrimPrefix(limitsConfig, "\n"), dstPath: "/etc/security/limits.conf"},
{srcData: strings.TrimPrefix(prometheusNodeExporterConfig, "\n"), dstPath: "/etc/default/prometheus-node-exporter"},
{srcData: strings.TrimSpace(otelcolConfig), dstPath: "/etc/otelcol-contrib/config.yaml"},
}

if t.config.UsersFilePath != "" {
Expand All @@ -168,6 +176,13 @@ func (t *Terraform) configureAndRunAgents(extAgent *ssh.ExtAgent) error {
return
}

cmd = "sudo systemctl restart otelcol-contrib"
if out, err := sshc.RunCommand(cmd); err != nil {
mlog.Error("error running ssh command", mlog.Int("agent", agentNumber), mlog.String("cmd", cmd), mlog.String("out", string(out)), mlog.Err(err))
foundErr.Store(true)
return
}

if out, err := sshc.RunCommand("sudo sysctl -p"); err != nil {
mlog.Error("error running sysctl", mlog.String("output", string(out)), mlog.Err(err), mlog.Int("agent", agentNumber))
foundErr.Store(true)
Expand Down
48 changes: 24 additions & 24 deletions deployment/terraform/assets/bindata.go

Large diffs are not rendered by default.

12 changes: 11 additions & 1 deletion deployment/terraform/assets/cluster.tf
Original file line number Diff line number Diff line change
Expand Up @@ -157,7 +157,7 @@ resource "aws_instance" "metrics_server" {
}

ami = var.aws_ami
instance_type = "t3.xlarge"
instance_type = var.metrics_instance_type
count = var.app_instance_count > 0 ? 1 : 0
key_name = aws_key_pair.key.id
availability_zone = var.aws_az
Expand Down Expand Up @@ -572,6 +572,16 @@ resource "aws_security_group_rule" "metrics-pyroscope" {
security_group_id = aws_security_group.metrics[0].id
}

resource "aws_security_group_rule" "metrics-loki" {
count = length(aws_security_group.metrics) > 0 ? 1 : 0
type = "ingress"
from_port = 3100
to_port = 3100
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
security_group_id = aws_security_group.metrics[0].id
}

resource "aws_security_group_rule" "metrics-egress" {
count = var.app_instance_count > 0 ? 1 : 0
type = "egress"
Expand Down
9 changes: 8 additions & 1 deletion deployment/terraform/assets/datasource.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,11 @@ datasources:
version: 1
editable: true
jsonData:
timeInterval: "5s"
timeInterval: "5s"
- name: Loki
type: loki
access: proxy
url: http://localhost:3100
jsonData:
timeout: 60
maxLines: 1000
Loading

0 comments on commit b3fceaf

Please sign in to comment.