The LGTM stack, by Grafana Labs, combines best-in-class open-source tools to provide comprehensive system visibility, consisting of:
- Loki: Log aggregation system https://grafana.com/oss/loki/
- Grafana: Interface & Dashboards https://grafana.com/oss/grafana/
- Tempo: Distributed tracing storage and management https://grafana.com/oss/tempo/
- Mimir: Long-term metrics storage for Prometheus https://grafana.com/oss/mimir/
With this stack, we have a complete observability solution that covers logs, metrics, and traces, with support for high availability and scalability, plus all data will be present in a single location (grafana), making it easier to analyze and correlate events, and by using object storage as a backend, the solution becomes much more economical compared to others that require dedicated databases or persistent disks.
Click to expand
The architecture of the LGTM stack in a Kubernetes environment follows a well-defined flow of data collection, processing, and visualization:
-
Applications send telemetry data to an agent, in this case, the OpenTelemetry Collector.
-
OpenTelemetry Collector acts as a central hub, routing each type of data to its specific backend:
- Loki: for log processing
- Mimir: for metrics storage
- Tempo: for trace analysis
-
Data is stored in an Object Storage, with dedicated buckets for each tool.
-
Grafana is the interface where all data is queried, allowing for unified dashboards and alerts.
Also this architecture includes three optional components:
- Prometheus: collects custom metrics from apps and cluster and sends to Mimir
- Kube-state-metrics: collects metrics (CPU/Memory) of services/apps through the API server and outputs to Prometheus
- Promtail: agent that captures container logs and sends to Loki
Local:
- 2-4 CPUs
- 8 GB RAM
Production setup:
- Can vary a lot depending on the amount of data and traffic, it's recommended to start with a small setup and scale as needed, for small-mid environments the following is recommended (minimum):
- 8 CPUs
- 24 GB RAM
- 100 GB disk space (SSD, don't count for storage backends)
- Helm v3+
- kubectl
- For GCP: gcloud CLI
Note: This guide uses the official lgtm-distributed Helm chart from Grafana Labs for deployment.
To simplify the installation process, you can use the Makefile commands:
# Clone repository
git clone git@github.com:daviaraujocc/lgtm-stack.git
cd lgtm-stack
make install-local # For local testing, for using GCP cloud storage use make install-gcp and set PROJECT_ID
This will install the LGTM stack with the default configuration for local with the dependencies (promtail, dashboards, prometheus, MiniO). If you want to customize the installation, you can edit the helm/values-lgtm.local.yaml
file.
# Clone repository
git clone git@github.com:daviaraujocc/lgtm-stack.git
cd lgtm-stack
# Add repositories & create namespace
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
kubectl create ns monitoring
# Install prometheus operator for metrics collection and CRDs
helm install prometheus-operator --version 66.3.1 -n monitoring \
prometheus-community/kube-prometheus-stack -f helm/values-prometheus.yaml
For local testing scenarios. Uses local storage via MinIO.
helm install lgtm --version 2.1.0 -n monitoring \
grafana/lgtm-distributed -f helm/values-lgtm.local.yaml
For production environments, using GCP resources for storage and monitoring.
- Set up GCP resources:
# Set your project ID
export PROJECT_ID=your-project-id
# Create buckets with random suffix
export BUCKET_SUFFIX=$(openssl rand -hex 4 | tr -d "\n")
for bucket in logs traces metrics metrics-admin; do
gsutil mb -p ${PROJECT_ID} -c standard -l us-east1 gs://lgtm-${bucket}-${BUCKET_SUFFIX}
done
# Update bucket names in config
sed -i -E "s/(bucket_name:\s*lgtm-[^[:space:]]+)/\1-${BUCKET_SUFFIX}/g" helm/values-lgtm.gcp.yaml
# Create and configure service account
gcloud iam service-accounts create lgtm-monitoring \
--display-name "LGTM Monitoring" \
--project ${PROJECT_ID}
# Set permissions
for bucket in logs traces metrics metrics-admin; do
gsutil iam ch serviceAccount:lgtm-monitoring@${PROJECT_ID}.iam.gserviceaccount.com:admin \
gs://lgtm-${bucket}-${BUCKET_SUFFIX}
done
# Create service account key and secret
gcloud iam service-accounts keys create key.json \
--iam-account lgtm-monitoring@${PROJECT_ID}.iam.gserviceaccount.com
kubectl create secret generic lgtm-sa --from-file=key.json -n monitoring
- Install LGTM stack:
You can change values in helm/values-lgtm.gcp.yaml
to fit your environment if you want like ingress for grafana, etc.
helm install lgtm --version 2.1.0 -n monitoring \
grafana/lgtm-distributed -f helm/values-lgtm.gcp.yaml
# Install Promtail for collecting container logs
# Check if you are using Docker or CRI-O runtime
## Docker runtime
kubectl apply -f manifests/promtail.docker.yaml
## CRI-O runtime
## kubectl apply -f manifests/promtail.cri.yaml
After installation you can check components by running:
# Check if all pods are running
kubectl get pods -n monitoring
# To check logs
# Loki
kubectl logs -l app.kubernetes.io/name=loki -n monitoring
# Tempo
kubectl logs -l app.kubernetes.io/name=tempo -n monitoring
# Mimir
kubectl logs -l app.kubernetes.io/name=mimir -n monitoring
Follow the steps below to test each component:
# Access dashboard
kubectl port-forward svc/lgtm-grafana 3000:80 -n monitoring
# Get password credentials
kubectl get secret --namespace monitoring lgtm-grafana -o jsonpath="{.data.admin-password}" | base64 --decode
- Default username:
admin
- Access URL: http://localhost:3000
- Check default Grafana dashboards and Explore tab
After installation, verify each component is working correctly:
Test log ingestion and querying:
# Forward Loki port
kubectl port-forward svc/lgtm-loki-distributor 3100:3100 -n monitoring
# Send test log with timestamp and labels
curl -XPOST http://localhost:3100/loki/api/v1/push -H "Content-Type: application/json" -d '{
"streams": [{
"stream": { "app": "test", "level": "info" },
"values": [[ "'$(date +%s)000000000'", "Test log message" ]]
}]
}'
To verify:
- Open Grafana (http://localhost:3000)
- Go to Explore > Select Loki datasource
- Query using labels:
{app="test", level="info"}
- You should see your test message in the results
If you have installed promtail you can check the container logs also on Explore tab.
Since Tempo is compatible with the OpenTelemetry OTLP protocol, we will use the Jaeger Trace Generator, a tool that generates example traces and sends the data using OTLP.
# Forward Tempo port
kubectl port-forward svc/lgtm-tempo-distributor 4318:4318 -n monitoring
# Generate sample traces with service name 'test'
docker run --add-host=host.docker.internal:host-gateway --env=OTEL_EXPORTER_OTLP_ENDPOINT=http://host.docker.internal:4318 jaegertracing/jaeger-tracegen -service test -traces 10
To verify:
- Go to Explore > Select Tempo datasource
- Search by Service Name: 'test'
- You should see 10 traces with different spans
Since we have a Prometheus instance running inside the cluster sending basic metrics (CPU/Memory) to Mimir, you can already check the metrics in Grafana:
- Access Grafana
- Go to Explore > Select Mimir datasource
- Try these example queries:
rate(container_cpu_usage_seconds_total[5m])
- CPU usagecontainer_memory_usage_bytes
- Container memory usage
You can also push custom metrics to Mimir using Prometheus Pushgateway, to endpoint http://lgtm-mimir-nginx.monitoring:80/api/v1/push
.
OpenTelemetry is a set of APIs, libraries, agents, and instrumentation to provide observability for cloud-native software. It consists of three main components:
- OpenTelemetry SDK: Libraries for instrumenting applications to collect telemetry data (traces, metrics, logs).
- OpenTelemetry Collector: A vendor-agnostic agent that collects, processes, and exports telemetry data to backends.
- OpenTelemetry Protocol (OTLP): A standard for telemetry data exchange between applications and backends.
In this setup, we will use the OpenTelemetry Collector to route telemetry data to the appropriate backends (Loki, Tempo, Mimir).
To install the OpenTelemetry Collector:
# Install OpenTelemetry Collector
kubectl apply -f manifests/otel-collector.yaml
Check if the collector is up and running:
kubectl get pods -l app=otel-collector
kubectl logs -l app=otel-collector
We'll use a pre-instrumented Flask application (source code at flask-app/
) that generates traces, metrics, and logs using OpenTelemetry.
The application exposes an endpoint /random
that returns random numbers and generates telemetry data. The default endpoint used for sending telemetry data will be http://otel-collector:4318
.
- Deploy the sample application:
# Deploy sample app
kubectl apply -f manifests/app/flask-app.yaml
- Verify application deployment:
kubectl get pods -l app=flask-app
kubectl get svc flask-app-service
- Apply PodMonitor for metrics scraping:
kubectl apply -f manifests/app/podmonitor.yaml
- Generate traffic to the application:
# Get the application URL
# Port-forward the application
kubectl port-forward svc/flask-app 8000:8000 -n monitoring
# Send requests to generate telemetry data
for i in {1..50}; do
curl http://localhost:8000/random
sleep 0.5
done
- Check the generated telemetry data in Grafana:
Traces (Tempo):
-
Go to Explore > Select Tempo datasource
-
Search for Service Name: flask-app
-
You should see traces with GET /random operations
Metrics (Mimir):
-
Go to Explore > Select Mimir datasource
-
Try these queries:
# Total requests count
rate(request_count_total[5m])
Logs (Loki):
-
Go to Explore > Select Loki datasource
-
Query using labels:
{job="flask-app"}
You should see structured logs from the application.
In case you have new labels you want to add to logs in Loki through the OpenTelemetry Collector, you need to perform the following configuration:
- Edit the ConfigMap
otel-collector-config
- Locate the
processors.attributes/loki
section - Add your custom labels to the
loki.attribute.labels
list:
processors:
attributes/loki:
actions:
- action: insert
key: loki.format
value: raw
- action: insert
key: loki.attribute.labels
value: facility, level, source, host, app, namespace, pod, container, job, your_label
After modifying the ConfigMap, restart the collector pod to apply the changes:
kubectl rollout restart daemonset/otel-collector -n monitoring
# Using Makefile
make uninstall
# or manual
# Remove LGTM stack
helm uninstall lgtm -n monitoring
# Remove prometheus operator
helm uninstall prometheus-operator -n monitoring
# Remove namespace
kubectl delete ns monitoring
# Remove promtail & otel-collector
kubectl delete -f manifests/promtail.yaml
kubectl delete -f manifests/otel-collector.yaml
# For GCP setup, cleanup:
for bucket in logs traces metrics metrics-admin; do
gsutil rm -r gs://lgtm-${bucket}-${BUCKET_SUFFIX}
done
gcloud iam service-accounts delete lgtm-monitoring@${PROJECT_ID}.iam.gserviceaccount.com