bug: OOM when filtering lines after pattern

**Describe the bug**

I am having some log lines generated by `fabio` proxy.

The following query runs in a few milliseconds and runs fine and results in `2139` lines:

```
# RUNS OK
{job=~"fabio.*"}
|= `" "`
| pattern `<remote> [<time>] <service> <response_time> <status> <body_size> <scheme>://<dest> "<method> <path>`
```

However, the following query makes `loki` allocate over 60GB (!) of memory and fail with OOM out-of-memory:

```
# RESULTS IN OUT OF MEMORY
{job=~"fabio.*"}
|= `" "`
| pattern `<remote> [<time>] <service> <response_time> <status> <body_size> <scheme>://<dest> "<method> <path>`
|= `jenkins`
```

Today loki was several times restarted just because of this query as I was debugging the issue. There is nothing else happening on our loki instance. The initial loki memory usage after startup is around 500MB. The memory sharply raised every time right after sending the query.

Adding `|= 'jenkins'` __before__ patterns, runs completely fine and finishes in milliseconds.

```
# RUNS OK
{job=~"fabio.*"}
|= `" "`
|= `jenkins`
| pattern `<remote> [<time>] <service> <response_time> <status> <body_size> <scheme>://<dest> "<method> <path>`
```

Removing initial <code>|= `" "`</code> also works.

```
# RUNS OK
{job=~"fabio.*"}
| pattern `<remote> [<time>] <service> <response_time> <status> <body_size> <scheme>://<dest> "<method> <path>`
|= `jenkins`
```

Changing order works:

```
# RUNS OK
{job=~"fabio.*"}
|= `jenkins`
|= `" "`
| pattern `<remote> [<time>] <service> <response_time> <status> <body_size> <scheme>://<dest> "<method> <path>`
```

Shortening pattern results in OOM:

```
# also causes OOM
{job=~"fabio.*"}
|= `" "`
| pattern `<remote>`
|= `jenkins`
```

**To Reproduce**

1. Run the query in our environment.
2. Observe LOKI OOM.

<details><summary>Loki config:</summary>

```
# loki.yml
# https://community.grafana.com/t/how-to-improve-loki-performance/42800/2
# https://github.com/grafana/loki/issues/4204#issuecomment-1260868356
# https://grafana.com/docs/loki/latest/configuration/query-frontend/
---
auth_enabled: false

server:
  http_listen_port: ${NOMAD_PORT_http:-3100}
  grpc_listen_port: ${NOMAD_PORT_grpc:-9095}
  log_level: warn
  # https://github.com/grafana/loki/issues/2271
  grpc_server_max_recv_msg_size: 8388608
  grpc_server_max_send_msg_size: 8388608

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  # https://community.grafana.com/t/loki-error-on-port-9095-error-contacting-scheduler/67263/2
  instance_interface_names:
    - "lo"
  ring:
    instance_addr: 127.0.0.1
    instance_interface_names:
      - "lo"
    kvstore:
      # store: inmemory
      store: consul
      prefix: loki-collectors/
      consul:
        host: ...:8500
        #host: 127.0.0.1:8500

# https://grafana.com/docs/loki/latest/operations/storage/retention/
# https://github.com/grafana/loki/issues/6300#issuecomment-1431887039
compactor:
  working_directory: /loki/retention
  delete_request_store: filesystem
  retention_enabled: true

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h
    - from: 2023-08-22
      store: tsdb
      object_store: filesystem
      schema: v12
      index:
        prefix: index_
        period: 24h
    # https://grafana.com/docs/loki/latest/storage/
    # https://grafana.com/docs/loki/latest/operations/storage/tsdb/
    # https://grafana.com/docs/loki/latest/setup/upgrade/#structured-metadata-open-telemetry-schemas-and-indexes
    - from: 2024-05-07
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

storage_config:
  tsdb_shipper:
    active_index_directory: /loki/tsdb-index
    cache_location: /loki/tsdb-cache

query_range:
  parallelise_shardable_queries: false
  #
  #
  # make queries more cache-able by aligning them with their step intervals
  align_queries_with_step: true
  max_retries: 5
  # parallelize queries in 5min intervals
  cache_results: true
  results_cache:
    cache:
      # We're going to use the in-process "FIFO" cache
      #enable_fifocache: true
      #fifocache:
        #max_size_bytes: 2GB
        #validity: 24h
      # https://github.com/grafana/loki/issues/7677
      embedded_cache:
        enabled: true
        max_size_mb: 2048
        ttl: 24h

querier:
  max_concurrent: 2048

frontend:
  max_outstanding_per_tenant: 32768
  #log_queries_longer_than: 15s
  #address: 0.0.0.0
  #compress_responses: true

limits_config:
  # The time to keep the logs.
  retention_period: 60d
  # allow backfilling
  reject_old_samples_max_age: 60d
  #
  #split_queries_by_interval: 5m
  #
  # Fix maxium active stream limit exceeded. https://github.com/grafana/loki/issues/3335
  max_global_streams_per_user: 100000
  #
  #max_cache_freshness_per_query: 10m
  #max_query_parallelism: 120
  ## for big logs tune
  # We typically recommend setting per_stream_rate_limit no higher than 5MB,
  # and per_stream_rate_limit_burst no higher than 20MB.
  per_stream_rate_limit: 5M
  per_stream_rate_limit_burst: 20M
  # https://github.com/grafana/loki/issues/4204#issuecomment-1260868356
  #cardinality_limit: 200000
  ingestion_burst_size_mb: 1000
  ingestion_rate_mb: 10000
  #max_entries_limit_per_query: 1000000
  #max_label_value_length: 20480
  #max_label_name_length: 10240
  #max_label_names_per_series: 300

analytics:
  reporting_enabled: false
```

Loki job nomad config:

```
job "loki" {
  ...
  group "loki" {
    count = 1
    restart {
      interval = "15s"
      attempts = 1
      mode = "delay"
    }
    update {
      auto_revert  = true
    }
    network {
      port "http" {
        static = 3100
      }
      port "grpc" {
        static = 9095
      }
    }
    task "loki" {
      driver = "docker"
      template {
        destination = "local/loki.yml"
        data        = file("loki.yml")
      }
      user = "3044335:10269"
      config {
        image = "grafana/loki:3.0.0"
        ports = [ "http", "grpc" ]
        network_mode = "host"
        args = [
          "-config.file=/local/loki.yml",
          "-config.expand-env=true",
          "-print-config-stderr",
        ]
        mount {
          type     = "bind"
          source   = ".../loki"
          target   = "/loki"
          readonly = false
        }
      }
      kill_timeout = "1m"
      resources {
        memory     = 60000
      }
      service {
        name = "loki"
        tags = [
          "urlprefix-...",
          "metrics",
          "metricspath-/metrics",
        ]
        port = "http"
        check {
          type           = "http"
          path           = "/ready"
          interval       = "2s"
          timeout        = "2s"
          initial_status = "passing"
        }
      }
    }
  }
}
```

</details>

**Expected behavior**

I do not understand why adding `|=` filtering over 2139 lines causes loki to allocate SO MUCH memory.

**Environment:**
 - Infrastructure: fedora29 host, docker loki:3.0.0, promtail:3.0.0
 - Deployment tool: nomad

**Screenshots, Promtail config, or terminal output**

Loki job memory usage as monitored by Nomad + prometheus:

![image](https://github.com/grafana/loki/assets/15679217/82015618-6d9a-44f4-8e40-7c87e5bd2269)

Grafana view of the query. Adding `|= 'jenkins'` would result in OOM.

![image](https://github.com/grafana/loki/assets/15679217/0a341bf4-6f3b-420b-b4c0-ccdde69d670d)

Nomad job view of the loki job:

![image](https://github.com/grafana/loki/assets/15679217/f607f8aa-9178-43e4-b3c0-72ce337426d3)

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: OOM when filtering lines after pattern #13427

Kamilcuk
openedon Jul 5, 2024

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug: OOM when filtering lines after pattern #13427

Description

Kamilcukopenedon Jul 5, 2024

Activity

Metadata