Description
openedon Jul 5, 2024
Describe the bug
I am having some log lines generated by fabio
proxy.
The following query runs in a few milliseconds and runs fine and results in 2139
lines:
# RUNS OK
{job=~"fabio.*"}
|= `" "`
| pattern `<remote> [<time>] <service> <response_time> <status> <body_size> <scheme>://<dest> "<method> <path>`
However, the following query makes loki
allocate over 60GB (!) of memory and fail with OOM out-of-memory:
# RESULTS IN OUT OF MEMORY
{job=~"fabio.*"}
|= `" "`
| pattern `<remote> [<time>] <service> <response_time> <status> <body_size> <scheme>://<dest> "<method> <path>`
|= `jenkins`
Today loki was several times restarted just because of this query as I was debugging the issue. There is nothing else happening on our loki instance. The initial loki memory usage after startup is around 500MB. The memory sharply raised every time right after sending the query.
Adding |= 'jenkins'
before patterns, runs completely fine and finishes in milliseconds.
# RUNS OK
{job=~"fabio.*"}
|= `" "`
|= `jenkins`
| pattern `<remote> [<time>] <service> <response_time> <status> <body_size> <scheme>://<dest> "<method> <path>`
Removing initial |=
also works." "
# RUNS OK
{job=~"fabio.*"}
| pattern `<remote> [<time>] <service> <response_time> <status> <body_size> <scheme>://<dest> "<method> <path>`
|= `jenkins`
Changing order works:
# RUNS OK
{job=~"fabio.*"}
|= `jenkins`
|= `" "`
| pattern `<remote> [<time>] <service> <response_time> <status> <body_size> <scheme>://<dest> "<method> <path>`
Shortening pattern results in OOM:
# also causes OOM
{job=~"fabio.*"}
|= `" "`
| pattern `<remote>`
|= `jenkins`
To Reproduce
- Run the query in our environment.
- Observe LOKI OOM.
Loki config:
# loki.yml
# https://community.grafana.com/t/how-to-improve-loki-performance/42800/2
# https://github.com/grafana/loki/issues/4204#issuecomment-1260868356
# https://grafana.com/docs/loki/latest/configuration/query-frontend/
---
auth_enabled: false
server:
http_listen_port: ${NOMAD_PORT_http:-3100}
grpc_listen_port: ${NOMAD_PORT_grpc:-9095}
log_level: warn
# https://github.com/grafana/loki/issues/2271
grpc_server_max_recv_msg_size: 8388608
grpc_server_max_send_msg_size: 8388608
common:
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
# https://community.grafana.com/t/loki-error-on-port-9095-error-contacting-scheduler/67263/2
instance_interface_names:
- "lo"
ring:
instance_addr: 127.0.0.1
instance_interface_names:
- "lo"
kvstore:
# store: inmemory
store: consul
prefix: loki-collectors/
consul:
host: ...:8500
#host: 127.0.0.1:8500
# https://grafana.com/docs/loki/latest/operations/storage/retention/
# https://github.com/grafana/loki/issues/6300#issuecomment-1431887039
compactor:
working_directory: /loki/retention
delete_request_store: filesystem
retention_enabled: true
schema_config:
configs:
- from: 2020-10-24
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
- from: 2023-08-22
store: tsdb
object_store: filesystem
schema: v12
index:
prefix: index_
period: 24h
# https://grafana.com/docs/loki/latest/storage/
# https://grafana.com/docs/loki/latest/operations/storage/tsdb/
# https://grafana.com/docs/loki/latest/setup/upgrade/#structured-metadata-open-telemetry-schemas-and-indexes
- from: 2024-05-07
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
storage_config:
tsdb_shipper:
active_index_directory: /loki/tsdb-index
cache_location: /loki/tsdb-cache
query_range:
parallelise_shardable_queries: false
#
#
# make queries more cache-able by aligning them with their step intervals
align_queries_with_step: true
max_retries: 5
# parallelize queries in 5min intervals
cache_results: true
results_cache:
cache:
# We're going to use the in-process "FIFO" cache
#enable_fifocache: true
#fifocache:
#max_size_bytes: 2GB
#validity: 24h
# https://github.com/grafana/loki/issues/7677
embedded_cache:
enabled: true
max_size_mb: 2048
ttl: 24h
querier:
max_concurrent: 2048
frontend:
max_outstanding_per_tenant: 32768
#log_queries_longer_than: 15s
#address: 0.0.0.0
#compress_responses: true
limits_config:
# The time to keep the logs.
retention_period: 60d
# allow backfilling
reject_old_samples_max_age: 60d
#
#split_queries_by_interval: 5m
#
# Fix maxium active stream limit exceeded. https://github.com/grafana/loki/issues/3335
max_global_streams_per_user: 100000
#
#max_cache_freshness_per_query: 10m
#max_query_parallelism: 120
## for big logs tune
# We typically recommend setting per_stream_rate_limit no higher than 5MB,
# and per_stream_rate_limit_burst no higher than 20MB.
per_stream_rate_limit: 5M
per_stream_rate_limit_burst: 20M
# https://github.com/grafana/loki/issues/4204#issuecomment-1260868356
#cardinality_limit: 200000
ingestion_burst_size_mb: 1000
ingestion_rate_mb: 10000
#max_entries_limit_per_query: 1000000
#max_label_value_length: 20480
#max_label_name_length: 10240
#max_label_names_per_series: 300
analytics:
reporting_enabled: false
Loki job nomad config:
job "loki" {
...
group "loki" {
count = 1
restart {
interval = "15s"
attempts = 1
mode = "delay"
}
update {
auto_revert = true
}
network {
port "http" {
static = 3100
}
port "grpc" {
static = 9095
}
}
task "loki" {
driver = "docker"
template {
destination = "local/loki.yml"
data = file("loki.yml")
}
user = "3044335:10269"
config {
image = "grafana/loki:3.0.0"
ports = [ "http", "grpc" ]
network_mode = "host"
args = [
"-config.file=/local/loki.yml",
"-config.expand-env=true",
"-print-config-stderr",
]
mount {
type = "bind"
source = ".../loki"
target = "/loki"
readonly = false
}
}
kill_timeout = "1m"
resources {
memory = 60000
}
service {
name = "loki"
tags = [
"urlprefix-...",
"metrics",
"metricspath-/metrics",
]
port = "http"
check {
type = "http"
path = "/ready"
interval = "2s"
timeout = "2s"
initial_status = "passing"
}
}
}
}
}
Expected behavior
I do not understand why adding |=
filtering over 2139 lines causes loki to allocate SO MUCH memory.
Environment:
- Infrastructure: fedora29 host, docker loki:3.0.0, promtail:3.0.0
- Deployment tool: nomad
Screenshots, Promtail config, or terminal output
Loki job memory usage as monitored by Nomad + prometheus:
Grafana view of the query. Adding |= 'jenkins'
would result in OOM.
Nomad job view of the loki job:
Thanks!
Activity