Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
MM-60651: Add Loki to collect app, agents, and proxy logs (#818)
* Install Loki - Install Loki in the metrics instance - Add a new rule to the metrics instance security group accepting traffic from anywhere to port :3100, where Loki will listen for logs data. - Add a new datasource to Grafana so that we can query Loki from the unified UI. * Install and config OTel collector in app nodes - Install the OpenTelemtry collector (the -contrib version, which is the one including the filelog receiver we use) in every app node. - Add a configuration file templated with the files to be collected, the name and ID of the service, the metrics instance IP, and the specific operator. - Add an operator for the app and agent logs, which parses the JSON lines, extracting the timestamp and severity from the timestamp and level fields respectively. - Configure the OpenTelemetry collector using the configuration file template and corresponding operator, updating the signature of the setup{App,MM,Job}Server functions to pass the instance name. * Install and config OTel collector in agent nodes - Apply the same steps as in the previous commit, but this time collecting the agents' and coordinators' logs. * Install and config OTel collector in proxy nodes - Same as the previous two commits. - Update the signature of setupProxyServer function to receive the whole instance object, which gives access to the instance name. - Collect the nginx error log file, which is no longer JSON as the app and agent logs, but a plain-text line that we parse using a new regex-based operator, which parses the timestamp and severity as well from the specified capturing groups. * Add logs panel to Grafana dashboard Add a new Logs overview row containing five panels: 1. A list of all the raw logs, pretty-printed for easily spotting patterns. 2. A count of log lines by level for the coordinator. 3. A count of log lines by level for the agents. 4. A count of log lines by level for the app nodes. 5. A count of log lines by level for the proxy nodes. The log-count panels contain a hack: we need a dummy query (we use `sum(vector(0))`) for Grafana to show the whole time range selected. Otherwise, those panels automatically zoom to the smallest time range where there is at least one log line. This solution is copied from https://community.grafana.com/t/empty-values-not-shown/76173/26 * Parametrize the metrics instance type * make assets * Assign different services to agent/coordinator Also, refactor test to make it testable. And test it. * Accept any value in data map in fillConfigTemplate * Simplify otelcolConfigTmpl * Test the new renderXYZOtelcolConfig functions * Deploy Loki sec group if there is a metrics server * Move MetricsInstanceType closer to other types * Revert unwanted dashboard changes * Use the new service_name in queries * Install Loki through the deb file from Github * make assets
- Loading branch information