Skip to content

"counter cannot decrease in value" panic when bloom filtering is applied #13300

Open

Description

Describe the bug

In the index-gateway, bloomquerier.FilterChunkRefs appears to panic because more "postFilter" chunks are returned than "preFiltered" chunks. The actual panic is in the prometheus counter.Add call, which panics if the value passed to it is less than 0.

With debug logging enabled, I am able to see that preFilterChunks is sometimes smaller than postFilterChunks. Glancing at the code, the panic occurs when filteredChunks is computed and the value is < 0 and added to the prometheus counter. Here are some examples of FilterChunkRefs calls that appear to return < 0 filteredChunks values.

ts=2024-06-24T21:26:03.76527129Z caller=spanlogger.go:109 component=index-gateway method=bloomquerier.FilterChunkRefs user=fake level=debug tenant=fake from=2024-06-22T21:45:00Z through=2024-06-22T22:00:00Z responses=2 preFilterChunks=7
2 postFilterChunks=78 skippedChunks=0 filteredChunks=-6 preFilterSeries=40 postFilterSeries=40 skippedSeries=0 filteredSeries=0
ts=2024-06-24T21:26:03.795554211Z caller=spanlogger.go:109 component=index-gateway method=bloomquerier.FilterChunkRefs user=fake level=debug tenant=fake from=2024-06-22T21:30:00Z through=2024-06-22T21:45:00Z responses=2 preFilterChunks=
89 postFilterChunks=90 skippedChunks=64 filteredChunks=-1 preFilterSeries=32 postFilterSeries=32 skippedSeries=10 filteredSeries=0
ts=2024-06-24T21:26:04.300281074Z caller=spanlogger.go:109 component=index-gateway method=bloomquerier.FilterChunkRefs user=fake level=debug tenant=fake from=2024-06-22T21:29:00Z through=2024-06-22T21:44:00Z responses=2 preFilterChunks=
80 postFilterChunks=81 skippedChunks=2 filteredChunks=-1 preFilterSeries=49 postFilterSeries=49 skippedSeries=2 filteredSeries=0

This causes the query to fail but doesn't occur consistently.

To Reproduce

We're running the latest pre-release build for 3.1.0: k208-ede6941 - was also able to reproduce this issue in the last release k207.

Here's a query we're running that triggers this. It only occurs when we're searching time periods that are covered by bloom filters - so most recent data doesn't seem to trigger the issue, but if I run a query from now-48h to now-47h I can repro this.

{cluster=~".+dev.+", cluster!="omitted"} |= "7n5GcjXJg2iMof2Xrw"

Expected behavior
I would expect this query to run reliably, leveraging the bloom filters to filter chunks that aren't needed in the search.

Environment:

  • Infrastructure: Kubernetes
  • Deployment tool: Helm

Screenshots, Promtail config, or terminal output

image

panic: counter cannot decrease in value
goroutine 26869 [running]:
github.com/grafana/loki/v3/pkg/util/server.onPanic({0x20887c0, 0x40138800a0})
	/src/loki/pkg/util/server/recovery.go:57 +0x48
github.com/grafana/loki/v3/pkg/util/server.init.WithRecoveryHandler.func4.1({0x40108d88a8?, 0x13b77f0?}, {0x20887c0?, 0x40138800a0?})
	/src/loki/vendor/github.com/grpc-ecosystem/go-grpc-middleware/recovery/options.go:33 +0x34
github.com/grpc-ecosystem/go-grpc-middleware/recovery.recoverFrom({0x2f40c98?, 0x400722c510?}, {0x20887c0?, 0x40138800a0?}, 0x40108d8918?)
	/src/loki/vendor/github.com/grpc-ecosystem/go-grpc-middleware/recovery/interceptors.go:61 +0x38
github.com/grpc-ecosystem/go-grpc-middleware/recovery.UnaryServerInterceptor.func1.1()
	/src/loki/vendor/github.com/grpc-ecosystem/go-grpc-middleware/recovery/interceptors.go:29 +0x78
panic({0x20887c0?, 0x40138800a0?})
	/usr/local/go/src/runtime/panic.go:770 +0x124
github.com/prometheus/client_golang/prometheus.(*counter).Add(0x4011f5f630?, 0x4005272f00?)
	/src/loki/vendor/github.com/prometheus/client_golang/prometheus/counter.go:128 +0xfc
github.com/grafana/loki/v3/pkg/bloomgateway.(*BloomQuerier).FilterChunkRefs(0x4000c1d4a0, {0x2f40c98, 0x400722c510}, {0x2578f2b, 0x4}, 0x19041da3d60, 0x19041e7f900, {0x4007326008, 0x33b, 0x3ff}, ...)
	/src/loki/pkg/bloomgateway/querier.go:216 +0x14ac
github.com/grafana/loki/v3/pkg/indexgateway.(*Gateway).GetChunkRef(0x4000b6c848, {0x2f40c98, 0x400722c510}, 0x400b284fa0)
	/src/loki/pkg/indexgateway/gateway.go:254 +0x270
github.com/grafana/loki/v3/pkg/logproto._IndexGateway_GetChunkRef_Handler.func1({0x2f40c98?, 0x400722c510?}, {0x247a9c0?, 0x400b284fa0?})
	/src/loki/pkg/logproto/indexgateway.pb.go:754 +0xd0
github.com/grpc-ecosystem/go-grpc-middleware/recovery.UnaryServerInterceptor.func1({0x2f40c98?, 0x400722c510?}, {0x247a9c0?, 0x400b284fa0?}, 0x40108d9048?, 0x1da78c0?)
	/src/loki/vendor/github.com/grpc-ecosystem/go-grpc-middleware/recovery/interceptors.go:33 +0x98
google.golang.org/grpc.getChainUnaryHandler.func1({0x2f40c98, 0x400722c510}, {0x247a9c0, 0x400b284fa0})
	/src/loki/vendor/google.golang.org/grpc/server.go:1203 +0xa0
github.com/grafana/loki/v3/pkg/util/fakeauth.init.func2({0x2f40c98, 0x400722c4e0}, {0x247a9c0, 0x400b284fa0}, 0x400a302340?, 0x4007220e00)
	/src/loki/pkg/util/fakeauth/fake_auth.go:60 +0x80
google.golang.org/grpc.getChainUnaryHandler.func1({0x2f40c98, 0x400722c4e0}, {0x247a9c0, 0x400b284fa0})
	/src/loki/vendor/google.golang.org/grpc/server.go:1203 +0xa0
github.com/grafana/dskit/middleware.UnaryServerInstrumentInterceptor.func1({0x2f40c98, 0x400722c4e0}, {0x247a9c0, 0x400b284fa0}, 0x400a302340, 0x4007220dc0)
	/src/loki/vendor/github.com/grafana/dskit/middleware/grpc_instrumentation.go:46 +0x8c
google.golang.org/grpc.getChainUnaryHandler.func1({0x2f40c98, 0x400722c4e0}, {0x247a9c0, 0x400b284fa0})
	/src/loki/vendor/google.golang.org/grpc/server.go:1203 +0xa0
github.com/grafana/dskit/server.newServer.HTTPGRPCTracingInterceptor.func3({0x2f40c98?, 0x400722c4e0?}, {0x247a9c0?, 0x400b284fa0?}, 0x400a302340?, 0x400f42bba8?)
	/src/loki/vendor/github.com/grafana/dskit/middleware/http_tracing.go:74 +0x724
google.golang.org/grpc.getChainUnaryHandler.func1({0x2f40c98, 0x400722c4e0}, {0x247a9c0, 0x400b284fa0})
	/src/loki/vendor/google.golang.org/grpc/server.go:1203 +0xa0
github.com/opentracing-contrib/go-grpc.OpenTracingServerInterceptor.func1({0x2f40c98, 0x400722c330}, {0x247a9c0, 0x400b284fa0}, 0x400a302340, 0x4007220d40)
	/src/loki/vendor/github.com/opentracing-contrib/go-grpc/server.go:57 +0x2e8
google.golang.org/grpc.getChainUnaryHandler.func1({0x2f40c98, 0x400722c330}, {0x247a9c0, 0x400b284fa0})
	/src/loki/vendor/google.golang.org/grpc/server.go:1203 +0xa0
github.com/grafana/dskit/middleware.GRPCServerLog.UnaryServerInterceptor({{0x2f1ce00?, 0x4000dfbea0?}, 0x1?, 0x8f?}, {0x2f40c98, 0x400722c330}, {0x247a9c0, 0x400b284fa0}, 0x400a302340, 0x4007220d00)
	/src/loki/vendor/github.com/grafana/dskit/middleware/grpc_logging.go:54 +0x80
google.golang.org/grpc.NewServer.chainUnaryServerInterceptors.chainUnaryInterceptors.func1({0x2f40c98, 0x400722c330}, {0x247a9c0, 0x400b284fa0}, 0x400a302340, 0x245c100?)
	/src/loki/vendor/google.golang.org/grpc/server.go:1194 +0x88
github.com/grafana/loki/v3/pkg/logproto._IndexGateway_GetChunkRef_Handler({0x245dec0, 0x4000b6c848}, {0x2f40c98, 0x400722c330}, 0x4012abed80, 0x40009351c0)
	/src/loki/pkg/logproto/indexgateway.pb.go:756 +0x148
google.golang.org/grpc.(*Server).processUnaryRPC(0x4000ab4e00, {0x2f40c98, 0x400722c240}, {0x2f5cb00, 0x400019e000}, 0x4007b5be60, 0x4003807620, 0x4663bc0, 0x0)
	/src/loki/vendor/google.golang.org/grpc/server.go:1386 +0xb58
google.golang.org/grpc.(*Server).handleStream(0x4000ab4e00, {0x2f5cb00, 0x400019e000}, 0x4007b5be60)
	/src/loki/vendor/google.golang.org/grpc/server.go:1797 +0xb10
google.golang.org/grpc.(*Server).serveStreams.func2.1()
	/src/loki/vendor/google.golang.org/grpc/server.go:1027 +0x8c
created by google.golang.org/grpc.(*Server).serveStreams.func2 in goroutine 1039
	/src/loki/vendor/google.golang.org/grpc/server.go:1038 +0x13c

here is our loki config for reference:

    auth_enabled: false    
    bloom_compactor:
      ring:
        num_tokens: 50
      enabled: true
      worker_parallelism: 20
      max_compaction_parallelism: 2
      min_table_offset: 1
      max_table_offset: 7
      retention:
        enabled: true
        max_lookback_days: 7
    bloom_gateway:
      enabled: true
      worker_concurrency: 16
      block_query_concurrency: 32
      client:
        results_cache:
          compression: snappy
          cache:
            default_validity: 1h
            memcached:
              expiration: 1h
            memcached_client:
              addresses: dnssrvnoa+_memcached-client._tcp.loki-prototype-chunks-cache.observability.svc
        cache_results: true
        addresses: dnssrvnoa+_grpc._tcp.loki-prototype-bloom-gateway-headless.observability.svc.cluster.local
    chunk_store_config:
      chunk_cache_config:
        background:
          writeback_buffer: 500000
          writeback_goroutines: 1
          writeback_size_limit: 500MB
        default_validity: 1h
        memcached:
          batch_size: 4
          parallelism: 5
        memcached_client:
          addresses: dnssrvnoa+_memcached-client._tcp.loki-prototype-chunks-cache.observability.svc
          consistent_hash: true
          max_idle_conns: 72
          timeout: 2000ms
    internal_server:
      http_server_read_timeout: 10m0s
      http_server_write_timeout: 10m0s
      http_server_idle_timeout: 10m0s
    common:
      compactor_address: 'http://loki-prototype-compactor:3100'
      path_prefix: /var/loki
      replication_factor: 3
      storage:
        s3:
          bucketnames: omitted
          insecure: false
          region: us-east-1
          s3forcepathstyle: false
    compactor:
      delete_request_cancel_period: 1h
      delete_request_store: s3
      retention_delete_delay: 2h
      retention_delete_worker_count: 150
      retention_enabled: true
    distributor:
      otlp_config:
        default_resource_attributes_as_index_labels:
        - service
        - namespace
        - cluster
        - deployment
        - container
    frontend:
      scheduler_address: loki-prototype-query-scheduler.observability.svc.cluster.local:9095
      tail_proxy_url: http://loki-prototype-querier.observability.svc.cluster.local:3100
    frontend_worker:
      scheduler_address: loki-prototype-query-scheduler.observability.svc.cluster.local:9095
    index_gateway:
      mode: simple
    ingester:
      chunk_encoding: snappy
      # chunk_target_size: 3145728    # Default is 1572864
      # max_chunk_age: 30m
      wal:
        enabled: false
    limits_config:
      allow_structured_metadata: true
      bloom_compactor_enable_compaction: true
      bloom_gateway_enable_filtering: true
      bloom_gateway_shard_size: 3
      bloom_ngram_length: 4
      ingestion_rate_mb: 2500
      max_cache_freshness_per_query: 10m
      max_concurrent_tail_requests: 200
      max_global_streams_per_user: 100000
      query_timeout: 300s
      reject_old_samples: true
      reject_old_samples_max_age: 168h
      retention_period: 7d
      split_queries_by_interval: 15m
      volume_enabled: true
      shard_streams:
        enabled: true
        logging_enabled: true
        desired_rate: 3072KB
    memberlist:
      join_members:
      - loki-memberlist
    pattern_ingester:
      enabled: false
    querier:
      max_concurrent: 32
    query_range:
      align_queries_with_step: true
      cache_results: true
      results_cache:
        cache:
          background:
            writeback_buffer: 500000
            writeback_goroutines: 1
            writeback_size_limit: 500MB
          default_validity: 12h
          memcached_client:
            addresses: dnssrvnoa+_memcached-client._tcp.loki-prototype-results-cache.observability.svc
            consistent_hash: true
            timeout: 500ms
            update_interval: 1m
    ruler:
      storage:
        s3:
          bucketnames: omitted
          insecure: false
          region: us-east-1
          s3forcepathstyle: false
        type: s3
    runtime_config:
      file: /etc/loki/runtime-config/runtime-config.yaml
    schema_config:
      configs:
      - from: "2024-04-01"
        index:
          period: 24h
          prefix: loki_index_
        object_store: s3
        schema: v13
        store: tsdb
    server:
      grpc_listen_port: 9095
      http_listen_port: 3100
      http_server_idle_timeout: 10m0s
      http_server_read_timeout: 600s
      http_server_write_timeout: 600s
    storage_config:
      max_parallel_get_chunk: 300
      boltdb_shipper:
        index_gateway_client:
          server_address: dns+loki-prototype-index-gateway-headless.observability.svc.cluster.local:9095
      hedging:
        at: 250ms
        max_per_second: 20
        up_to: 3
      tsdb_shipper:
        index_gateway_client:
          server_address: dns+loki-prototype-index-gateway-headless.observability.svc.cluster.local:9095
      bloom_shipper:
        blocks_cache:
          soft_limit: 768GiB
          hard_limit: 896GiB
    tracing:
      enabled: false
      profiling_enabled: false

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions