Description
openedon Jun 24, 2024
Describe the bug
In the index-gateway, bloomquerier.FilterChunkRefs
appears to panic because more "postFilter" chunks are returned than "preFiltered" chunks. The actual panic is in the prometheus counter.Add
call, which panics if the value passed to it is less than 0.
With debug logging enabled, I am able to see that preFilterChunks
is sometimes smaller than postFilterChunks
. Glancing at the code, the panic occurs when filteredChunks
is computed and the value is < 0 and added to the prometheus counter. Here are some examples of FilterChunkRefs calls that appear to return < 0 filteredChunks values.
ts=2024-06-24T21:26:03.76527129Z caller=spanlogger.go:109 component=index-gateway method=bloomquerier.FilterChunkRefs user=fake level=debug tenant=fake from=2024-06-22T21:45:00Z through=2024-06-22T22:00:00Z responses=2 preFilterChunks=7
2 postFilterChunks=78 skippedChunks=0 filteredChunks=-6 preFilterSeries=40 postFilterSeries=40 skippedSeries=0 filteredSeries=0
ts=2024-06-24T21:26:03.795554211Z caller=spanlogger.go:109 component=index-gateway method=bloomquerier.FilterChunkRefs user=fake level=debug tenant=fake from=2024-06-22T21:30:00Z through=2024-06-22T21:45:00Z responses=2 preFilterChunks=
89 postFilterChunks=90 skippedChunks=64 filteredChunks=-1 preFilterSeries=32 postFilterSeries=32 skippedSeries=10 filteredSeries=0
ts=2024-06-24T21:26:04.300281074Z caller=spanlogger.go:109 component=index-gateway method=bloomquerier.FilterChunkRefs user=fake level=debug tenant=fake from=2024-06-22T21:29:00Z through=2024-06-22T21:44:00Z responses=2 preFilterChunks=
80 postFilterChunks=81 skippedChunks=2 filteredChunks=-1 preFilterSeries=49 postFilterSeries=49 skippedSeries=2 filteredSeries=0
This causes the query to fail but doesn't occur consistently.
To Reproduce
We're running the latest pre-release build for 3.1.0: k208-ede6941
- was also able to reproduce this issue in the last release k207
.
Here's a query we're running that triggers this. It only occurs when we're searching time periods that are covered by bloom filters - so most recent data doesn't seem to trigger the issue, but if I run a query from now-48h
to now-47h
I can repro this.
{cluster=~".+dev.+", cluster!="omitted"} |= "7n5GcjXJg2iMof2Xrw"
Expected behavior
I would expect this query to run reliably, leveraging the bloom filters to filter chunks that aren't needed in the search.
Environment:
- Infrastructure: Kubernetes
- Deployment tool: Helm
Screenshots, Promtail config, or terminal output
panic: counter cannot decrease in value
goroutine 26869 [running]:
github.com/grafana/loki/v3/pkg/util/server.onPanic({0x20887c0, 0x40138800a0})
/src/loki/pkg/util/server/recovery.go:57 +0x48
github.com/grafana/loki/v3/pkg/util/server.init.WithRecoveryHandler.func4.1({0x40108d88a8?, 0x13b77f0?}, {0x20887c0?, 0x40138800a0?})
/src/loki/vendor/github.com/grpc-ecosystem/go-grpc-middleware/recovery/options.go:33 +0x34
github.com/grpc-ecosystem/go-grpc-middleware/recovery.recoverFrom({0x2f40c98?, 0x400722c510?}, {0x20887c0?, 0x40138800a0?}, 0x40108d8918?)
/src/loki/vendor/github.com/grpc-ecosystem/go-grpc-middleware/recovery/interceptors.go:61 +0x38
github.com/grpc-ecosystem/go-grpc-middleware/recovery.UnaryServerInterceptor.func1.1()
/src/loki/vendor/github.com/grpc-ecosystem/go-grpc-middleware/recovery/interceptors.go:29 +0x78
panic({0x20887c0?, 0x40138800a0?})
/usr/local/go/src/runtime/panic.go:770 +0x124
github.com/prometheus/client_golang/prometheus.(*counter).Add(0x4011f5f630?, 0x4005272f00?)
/src/loki/vendor/github.com/prometheus/client_golang/prometheus/counter.go:128 +0xfc
github.com/grafana/loki/v3/pkg/bloomgateway.(*BloomQuerier).FilterChunkRefs(0x4000c1d4a0, {0x2f40c98, 0x400722c510}, {0x2578f2b, 0x4}, 0x19041da3d60, 0x19041e7f900, {0x4007326008, 0x33b, 0x3ff}, ...)
/src/loki/pkg/bloomgateway/querier.go:216 +0x14ac
github.com/grafana/loki/v3/pkg/indexgateway.(*Gateway).GetChunkRef(0x4000b6c848, {0x2f40c98, 0x400722c510}, 0x400b284fa0)
/src/loki/pkg/indexgateway/gateway.go:254 +0x270
github.com/grafana/loki/v3/pkg/logproto._IndexGateway_GetChunkRef_Handler.func1({0x2f40c98?, 0x400722c510?}, {0x247a9c0?, 0x400b284fa0?})
/src/loki/pkg/logproto/indexgateway.pb.go:754 +0xd0
github.com/grpc-ecosystem/go-grpc-middleware/recovery.UnaryServerInterceptor.func1({0x2f40c98?, 0x400722c510?}, {0x247a9c0?, 0x400b284fa0?}, 0x40108d9048?, 0x1da78c0?)
/src/loki/vendor/github.com/grpc-ecosystem/go-grpc-middleware/recovery/interceptors.go:33 +0x98
google.golang.org/grpc.getChainUnaryHandler.func1({0x2f40c98, 0x400722c510}, {0x247a9c0, 0x400b284fa0})
/src/loki/vendor/google.golang.org/grpc/server.go:1203 +0xa0
github.com/grafana/loki/v3/pkg/util/fakeauth.init.func2({0x2f40c98, 0x400722c4e0}, {0x247a9c0, 0x400b284fa0}, 0x400a302340?, 0x4007220e00)
/src/loki/pkg/util/fakeauth/fake_auth.go:60 +0x80
google.golang.org/grpc.getChainUnaryHandler.func1({0x2f40c98, 0x400722c4e0}, {0x247a9c0, 0x400b284fa0})
/src/loki/vendor/google.golang.org/grpc/server.go:1203 +0xa0
github.com/grafana/dskit/middleware.UnaryServerInstrumentInterceptor.func1({0x2f40c98, 0x400722c4e0}, {0x247a9c0, 0x400b284fa0}, 0x400a302340, 0x4007220dc0)
/src/loki/vendor/github.com/grafana/dskit/middleware/grpc_instrumentation.go:46 +0x8c
google.golang.org/grpc.getChainUnaryHandler.func1({0x2f40c98, 0x400722c4e0}, {0x247a9c0, 0x400b284fa0})
/src/loki/vendor/google.golang.org/grpc/server.go:1203 +0xa0
github.com/grafana/dskit/server.newServer.HTTPGRPCTracingInterceptor.func3({0x2f40c98?, 0x400722c4e0?}, {0x247a9c0?, 0x400b284fa0?}, 0x400a302340?, 0x400f42bba8?)
/src/loki/vendor/github.com/grafana/dskit/middleware/http_tracing.go:74 +0x724
google.golang.org/grpc.getChainUnaryHandler.func1({0x2f40c98, 0x400722c4e0}, {0x247a9c0, 0x400b284fa0})
/src/loki/vendor/google.golang.org/grpc/server.go:1203 +0xa0
github.com/opentracing-contrib/go-grpc.OpenTracingServerInterceptor.func1({0x2f40c98, 0x400722c330}, {0x247a9c0, 0x400b284fa0}, 0x400a302340, 0x4007220d40)
/src/loki/vendor/github.com/opentracing-contrib/go-grpc/server.go:57 +0x2e8
google.golang.org/grpc.getChainUnaryHandler.func1({0x2f40c98, 0x400722c330}, {0x247a9c0, 0x400b284fa0})
/src/loki/vendor/google.golang.org/grpc/server.go:1203 +0xa0
github.com/grafana/dskit/middleware.GRPCServerLog.UnaryServerInterceptor({{0x2f1ce00?, 0x4000dfbea0?}, 0x1?, 0x8f?}, {0x2f40c98, 0x400722c330}, {0x247a9c0, 0x400b284fa0}, 0x400a302340, 0x4007220d00)
/src/loki/vendor/github.com/grafana/dskit/middleware/grpc_logging.go:54 +0x80
google.golang.org/grpc.NewServer.chainUnaryServerInterceptors.chainUnaryInterceptors.func1({0x2f40c98, 0x400722c330}, {0x247a9c0, 0x400b284fa0}, 0x400a302340, 0x245c100?)
/src/loki/vendor/google.golang.org/grpc/server.go:1194 +0x88
github.com/grafana/loki/v3/pkg/logproto._IndexGateway_GetChunkRef_Handler({0x245dec0, 0x4000b6c848}, {0x2f40c98, 0x400722c330}, 0x4012abed80, 0x40009351c0)
/src/loki/pkg/logproto/indexgateway.pb.go:756 +0x148
google.golang.org/grpc.(*Server).processUnaryRPC(0x4000ab4e00, {0x2f40c98, 0x400722c240}, {0x2f5cb00, 0x400019e000}, 0x4007b5be60, 0x4003807620, 0x4663bc0, 0x0)
/src/loki/vendor/google.golang.org/grpc/server.go:1386 +0xb58
google.golang.org/grpc.(*Server).handleStream(0x4000ab4e00, {0x2f5cb00, 0x400019e000}, 0x4007b5be60)
/src/loki/vendor/google.golang.org/grpc/server.go:1797 +0xb10
google.golang.org/grpc.(*Server).serveStreams.func2.1()
/src/loki/vendor/google.golang.org/grpc/server.go:1027 +0x8c
created by google.golang.org/grpc.(*Server).serveStreams.func2 in goroutine 1039
/src/loki/vendor/google.golang.org/grpc/server.go:1038 +0x13c
here is our loki config for reference:
auth_enabled: false
bloom_compactor:
ring:
num_tokens: 50
enabled: true
worker_parallelism: 20
max_compaction_parallelism: 2
min_table_offset: 1
max_table_offset: 7
retention:
enabled: true
max_lookback_days: 7
bloom_gateway:
enabled: true
worker_concurrency: 16
block_query_concurrency: 32
client:
results_cache:
compression: snappy
cache:
default_validity: 1h
memcached:
expiration: 1h
memcached_client:
addresses: dnssrvnoa+_memcached-client._tcp.loki-prototype-chunks-cache.observability.svc
cache_results: true
addresses: dnssrvnoa+_grpc._tcp.loki-prototype-bloom-gateway-headless.observability.svc.cluster.local
chunk_store_config:
chunk_cache_config:
background:
writeback_buffer: 500000
writeback_goroutines: 1
writeback_size_limit: 500MB
default_validity: 1h
memcached:
batch_size: 4
parallelism: 5
memcached_client:
addresses: dnssrvnoa+_memcached-client._tcp.loki-prototype-chunks-cache.observability.svc
consistent_hash: true
max_idle_conns: 72
timeout: 2000ms
internal_server:
http_server_read_timeout: 10m0s
http_server_write_timeout: 10m0s
http_server_idle_timeout: 10m0s
common:
compactor_address: 'http://loki-prototype-compactor:3100'
path_prefix: /var/loki
replication_factor: 3
storage:
s3:
bucketnames: omitted
insecure: false
region: us-east-1
s3forcepathstyle: false
compactor:
delete_request_cancel_period: 1h
delete_request_store: s3
retention_delete_delay: 2h
retention_delete_worker_count: 150
retention_enabled: true
distributor:
otlp_config:
default_resource_attributes_as_index_labels:
- service
- namespace
- cluster
- deployment
- container
frontend:
scheduler_address: loki-prototype-query-scheduler.observability.svc.cluster.local:9095
tail_proxy_url: http://loki-prototype-querier.observability.svc.cluster.local:3100
frontend_worker:
scheduler_address: loki-prototype-query-scheduler.observability.svc.cluster.local:9095
index_gateway:
mode: simple
ingester:
chunk_encoding: snappy
# chunk_target_size: 3145728 # Default is 1572864
# max_chunk_age: 30m
wal:
enabled: false
limits_config:
allow_structured_metadata: true
bloom_compactor_enable_compaction: true
bloom_gateway_enable_filtering: true
bloom_gateway_shard_size: 3
bloom_ngram_length: 4
ingestion_rate_mb: 2500
max_cache_freshness_per_query: 10m
max_concurrent_tail_requests: 200
max_global_streams_per_user: 100000
query_timeout: 300s
reject_old_samples: true
reject_old_samples_max_age: 168h
retention_period: 7d
split_queries_by_interval: 15m
volume_enabled: true
shard_streams:
enabled: true
logging_enabled: true
desired_rate: 3072KB
memberlist:
join_members:
- loki-memberlist
pattern_ingester:
enabled: false
querier:
max_concurrent: 32
query_range:
align_queries_with_step: true
cache_results: true
results_cache:
cache:
background:
writeback_buffer: 500000
writeback_goroutines: 1
writeback_size_limit: 500MB
default_validity: 12h
memcached_client:
addresses: dnssrvnoa+_memcached-client._tcp.loki-prototype-results-cache.observability.svc
consistent_hash: true
timeout: 500ms
update_interval: 1m
ruler:
storage:
s3:
bucketnames: omitted
insecure: false
region: us-east-1
s3forcepathstyle: false
type: s3
runtime_config:
file: /etc/loki/runtime-config/runtime-config.yaml
schema_config:
configs:
- from: "2024-04-01"
index:
period: 24h
prefix: loki_index_
object_store: s3
schema: v13
store: tsdb
server:
grpc_listen_port: 9095
http_listen_port: 3100
http_server_idle_timeout: 10m0s
http_server_read_timeout: 600s
http_server_write_timeout: 600s
storage_config:
max_parallel_get_chunk: 300
boltdb_shipper:
index_gateway_client:
server_address: dns+loki-prototype-index-gateway-headless.observability.svc.cluster.local:9095
hedging:
at: 250ms
max_per_second: 20
up_to: 3
tsdb_shipper:
index_gateway_client:
server_address: dns+loki-prototype-index-gateway-headless.observability.svc.cluster.local:9095
bloom_shipper:
blocks_cache:
soft_limit: 768GiB
hard_limit: 896GiB
tracing:
enabled: false
profiling_enabled: false
Activity