Skip to content

bug: OOM when filtering lines after pattern #13427

Open

Description

Describe the bug

I am having some log lines generated by fabio proxy.

The following query runs in a few milliseconds and runs fine and results in 2139 lines:

# RUNS OK
{job=~"fabio.*"}
|= `" "`
| pattern `<remote> [<time>] <service> <response_time> <status> <body_size> <scheme>://<dest> "<method> <path>`

However, the following query makes loki allocate over 60GB (!) of memory and fail with OOM out-of-memory:

# RESULTS IN OUT OF MEMORY
{job=~"fabio.*"}
|= `" "`
| pattern `<remote> [<time>] <service> <response_time> <status> <body_size> <scheme>://<dest> "<method> <path>`
|= `jenkins`

Today loki was several times restarted just because of this query as I was debugging the issue. There is nothing else happening on our loki instance. The initial loki memory usage after startup is around 500MB. The memory sharply raised every time right after sending the query.

Adding |= 'jenkins' before patterns, runs completely fine and finishes in milliseconds.

# RUNS OK
{job=~"fabio.*"}
|= `" "`
|= `jenkins`
| pattern `<remote> [<time>] <service> <response_time> <status> <body_size> <scheme>://<dest> "<method> <path>`

Removing initial |= " " also works.

# RUNS OK
{job=~"fabio.*"}
| pattern `<remote> [<time>] <service> <response_time> <status> <body_size> <scheme>://<dest> "<method> <path>`
|= `jenkins`

Changing order works:

# RUNS OK
{job=~"fabio.*"}
|= `jenkins`
|= `" "`
| pattern `<remote> [<time>] <service> <response_time> <status> <body_size> <scheme>://<dest> "<method> <path>`

Shortening pattern results in OOM:

# also causes OOM
{job=~"fabio.*"}
|= `" "`
| pattern `<remote>`
|= `jenkins`

To Reproduce

  1. Run the query in our environment.
  2. Observe LOKI OOM.
Loki config:
# loki.yml
# https://community.grafana.com/t/how-to-improve-loki-performance/42800/2
# https://github.com/grafana/loki/issues/4204#issuecomment-1260868356
# https://grafana.com/docs/loki/latest/configuration/query-frontend/
---
auth_enabled: false

server:
  http_listen_port: ${NOMAD_PORT_http:-3100}
  grpc_listen_port: ${NOMAD_PORT_grpc:-9095}
  log_level: warn
  # https://github.com/grafana/loki/issues/2271
  grpc_server_max_recv_msg_size: 8388608
  grpc_server_max_send_msg_size: 8388608

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  # https://community.grafana.com/t/loki-error-on-port-9095-error-contacting-scheduler/67263/2
  instance_interface_names:
    - "lo"
  ring:
    instance_addr: 127.0.0.1
    instance_interface_names:
      - "lo"
    kvstore:
      # store: inmemory
      store: consul
      prefix: loki-collectors/
      consul:
        host: ...:8500
        #host: 127.0.0.1:8500

# https://grafana.com/docs/loki/latest/operations/storage/retention/
# https://github.com/grafana/loki/issues/6300#issuecomment-1431887039
compactor:
  working_directory: /loki/retention
  delete_request_store: filesystem
  retention_enabled: true

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h
    - from: 2023-08-22
      store: tsdb
      object_store: filesystem
      schema: v12
      index:
        prefix: index_
        period: 24h
    # https://grafana.com/docs/loki/latest/storage/
    # https://grafana.com/docs/loki/latest/operations/storage/tsdb/
    # https://grafana.com/docs/loki/latest/setup/upgrade/#structured-metadata-open-telemetry-schemas-and-indexes
    - from: 2024-05-07
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

storage_config:
  tsdb_shipper:
    active_index_directory: /loki/tsdb-index
    cache_location: /loki/tsdb-cache

query_range:
  parallelise_shardable_queries: false
  #
  #
  # make queries more cache-able by aligning them with their step intervals
  align_queries_with_step: true
  max_retries: 5
  # parallelize queries in 5min intervals
  cache_results: true
  results_cache:
    cache:
      # We're going to use the in-process "FIFO" cache
      #enable_fifocache: true
      #fifocache:
        #max_size_bytes: 2GB
        #validity: 24h
      # https://github.com/grafana/loki/issues/7677
      embedded_cache:
        enabled: true
        max_size_mb: 2048
        ttl: 24h

querier:
  max_concurrent: 2048

frontend:
  max_outstanding_per_tenant: 32768
  #log_queries_longer_than: 15s
  #address: 0.0.0.0
  #compress_responses: true

limits_config:
  # The time to keep the logs.
  retention_period: 60d
  # allow backfilling
  reject_old_samples_max_age: 60d
  #
  #split_queries_by_interval: 5m
  #
  # Fix maxium active stream limit exceeded. https://github.com/grafana/loki/issues/3335
  max_global_streams_per_user: 100000
  #
  #max_cache_freshness_per_query: 10m
  #max_query_parallelism: 120
  ## for big logs tune
  # We typically recommend setting per_stream_rate_limit no higher than 5MB,
  # and per_stream_rate_limit_burst no higher than 20MB.
  per_stream_rate_limit: 5M
  per_stream_rate_limit_burst: 20M
  # https://github.com/grafana/loki/issues/4204#issuecomment-1260868356
  #cardinality_limit: 200000
  ingestion_burst_size_mb: 1000
  ingestion_rate_mb: 10000
  #max_entries_limit_per_query: 1000000
  #max_label_value_length: 20480
  #max_label_name_length: 10240
  #max_label_names_per_series: 300

analytics:
  reporting_enabled: false

Loki job nomad config:

job "loki" {
  ...
  group "loki" {
    count = 1
    restart {
      interval = "15s"
      attempts = 1
      mode = "delay"
    }
    update {
      auto_revert  = true
    }
    network {
      port "http" {
        static = 3100
      }
      port "grpc" {
        static = 9095
      }
    }
    task "loki" {
      driver = "docker"
      template {
        destination = "local/loki.yml"
        data        = file("loki.yml")
      }
      user = "3044335:10269"
      config {
        image = "grafana/loki:3.0.0"
        ports = [ "http", "grpc" ]
        network_mode = "host"
        args = [
          "-config.file=/local/loki.yml",
          "-config.expand-env=true",
          "-print-config-stderr",
        ]
        mount {
          type     = "bind"
          source   = ".../loki"
          target   = "/loki"
          readonly = false
        }
      }
      kill_timeout = "1m"
      resources {
        memory     = 60000
      }
      service {
        name = "loki"
        tags = [
          "urlprefix-...",
          "metrics",
          "metricspath-/metrics",
        ]
        port = "http"
        check {
          type           = "http"
          path           = "/ready"
          interval       = "2s"
          timeout        = "2s"
          initial_status = "passing"
        }
      }
    }
  }
}

Expected behavior

I do not understand why adding |= filtering over 2139 lines causes loki to allocate SO MUCH memory.

Environment:

  • Infrastructure: fedora29 host, docker loki:3.0.0, promtail:3.0.0
  • Deployment tool: nomad

Screenshots, Promtail config, or terminal output

Loki job memory usage as monitored by Nomad + prometheus:

image

Grafana view of the query. Adding |= 'jenkins' would result in OOM.

image

Nomad job view of the loki job:

image

Thanks!

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions