Description
When deploying grafana loki in Simple Scalable with multiple read pods set in a kubernetes cluster, you sometimes end with a loki read pod that cannot execute any queries. This problem reflect in Grafana by a 504 Gateway timeout similar to this issue and is also linked to the 499 nginx issue found here
Expected behavior
Grafana loki should deploy no problem and should have a "tainted" read pods for no reason
Environment:
- Deployed using the official Helm chart of loki, version 6.10.0
- Deploying Grafana loki using version 3.2.1
- Deployed on Internal Cloud. Using Cilium version 1.16.4
- Storage is Azure Blob Storage
How to replicate:
- Deploy using the helm upgrade or helm install. Here is my final loki config file:
auth_enabled: false
common:
compactor_address: 'http://loki-backend:3100'
path_prefix: /var/loki
replication_factor: 1
storage:
azure:
account_key: CREDENTIALS
account_name: CREDENTIALS
container_name: loki
request_timeout: 30s
use_federated_token: false
use_managed_identity: false
compactor:
compaction_interval: 10m
delete_request_cancel_period: 24h
delete_request_store: azure
retention_delete_delay: 2h
retention_delete_worker_count: 150
retention_enabled: true
working_directory: /tmp
frontend:
scheduler_address: ""
tail_proxy_url: ""
frontend_worker:
scheduler_address: ""
index_gateway:
mode: simple
ingester:
chunk_idle_period: 30m
chunk_target_size: 1572864
flush_check_period: 15s
wal:
replay_memory_ceiling: 1024MB
limits_config:
allow_structured_metadata: true
ingestion_burst_size_mb: 30
ingestion_rate_mb: 30
ingestion_rate_strategy: local
max_cache_freshness_per_query: 10m
max_chunks_per_query: 100
max_concurrent_tail_requests: 100
max_entries_limit_per_query: 1000
max_global_streams_per_user: 50000
max_label_names_per_series: 17
max_line_size_truncate: false
max_query_parallelism: 128
max_query_series: 50
max_streams_matchers_per_query: 100
per_stream_rate_limit: 5Mb
per_stream_rate_limit_burst: 20Mb
query_timeout: 300s
reject_old_samples: true
reject_old_samples_max_age: 168h
retention_period: 14d
shard_streams:
desired_rate: 3000000
enabled: true
split_queries_by_interval: 1h
tsdb_max_bytes_per_shard: 2GB
tsdb_max_query_parallelism: 2048
volume_enabled: true
memberlist:
join_members:
- loki-memberlist
pattern_ingester:
enabled: false
query_range:
align_queries_with_step: true
ruler:
alertmanager_url: SECRET
enable_alertmanager_v2: true
enable_api: true
storage:
azure:
account_key: CREDENTIAL
account_name: CREDENTIAL
container_name: loki
request_timeout: 30s
use_federated_token: false
use_managed_identity: false
type: azure
runtime_config:
file: /etc/loki/runtime-config/runtime-config.yaml
schema_config:
configs:
- from: "2024-07-29"
index:
period: 24h
prefix: index_
object_store: azure
schema: v13
store: tsdb
server:
grpc_listen_port: 9095
grpc_server_max_recv_msg_size: 60000000
grpc_server_max_send_msg_size: 60000000
http_listen_port: 3100
http_server_idle_timeout: 600s
http_server_read_timeout: 600s
http_server_write_timeout: 600s
log_level: debug
storage_config:
hedging:
at: 250ms
max_per_second: 20
up_to: 3
tsdb_shipper:
index_gateway_client:
log_gateway_requests: true
server_address: dns+loki-backend-headless.6723a512e7641cd9c37269ed.svc.cluster.local:9095
tracing:
enabled: true
- Deploy with the following 3 read pods, 1 backend pod,1 write pod and 1 gateway pod. Here are my helm values:
backend:
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: projectId
operator: In
values:
- 27edfc6c-eb78-4790-a6c8-ed82a0478f7c
topologyKey: kubernetes.io/hostname
weight: 100
podAntiAffinity: null
extraEnv:
- name: GOMEMLIMIT
valueFrom:
resourceFieldRef:
resource: limits.memory
- name: GOMAXPROCS
valueFrom:
resourceFieldRef:
resource: limits.cpu
nodeSelector:
dedicated: logs-instance
persistence:
enableStatefulSetAutoDeletePVC: true
size: 16Gi
storageClass: csi-cinder-sc-delete
podAnnotations:
port: "3100"
type: loki-backend
podLabels:
app: logs
team: mops
replicas: 1
resources:
limits:
memory: 1280Mi
requests:
cpu: 1
memory: 1280Mi
tolerations:
- effect: NoSchedule
key: dedicated
operator: Equal
value: logs-instance
chunksCache:
enabled: false
distributor:
receivers:
otlp:
grpc:
max_recv_msg_size_mib: 60000000
enterprise:
enabled: false
frontend:
max_outstanding_per_tenant: 1000
scheduler_worker_concurrency: 15
fullnameOverride: loki
gateway:
enabled: true
ingress:
enabled: false
nodeSelector:
dedicated: logs-instance
tolerations:
- effect: NoSchedule
key: dedicated
operator: Equal
value: logs-instance
index:
in-memory-sorted-index:
retention_period: 24h
period: 168h
prefix: index_
ingress:
enabled: false
loki:
auth_enabled: false
commonConfig:
replication_factor: 1
compactor:
compaction_interval: 10m
delete_request_cancel_period: 24h
delete_request_store: azure
retention_delete_delay: 2h
retention_delete_worker_count: 150
retention_enabled: true
working_directory: /tmp
configStorageType: Secret
image:
pullPolicy: IfNotPresent
tag: 3.2.1
ingester:
chunk_idle_period: 30m
chunk_target_size: 1572864
flush_check_period: 15s
wal:
replay_memory_ceiling: 1024MB
limits_config:
allow_structured_metadata: true
ingestion_burst_size_mb: 30
ingestion_rate_mb: 30
ingestion_rate_strategy: local
max_cache_freshness_per_query: 10m
max_chunks_per_query: 100
max_concurrent_tail_requests: 100
max_entries_limit_per_query: 1000
max_global_streams_per_user: 50000
max_label_names_per_series: 17
max_line_size_truncate: false
max_query_parallelism: 128
max_query_series: 50
max_streams_matchers_per_query: 100
per_stream_rate_limit: 5Mb
per_stream_rate_limit_burst: 20Mb
reject_old_samples: true
reject_old_samples_max_age: 168h
retention_period: 14d
shard_streams:
desired_rate: 3000000
enabled: true
split_queries_by_interval: 1h
tsdb_max_bytes_per_shard: 2GB
tsdb_max_query_parallelism: 2048
rulerConfig:
alertmanager_url: SECRET
enable_alertmanager_v2: true
enable_api: true
schemaConfig:
configs:
- from: "2024-07-29"
index:
period: 24h
prefix: index_
object_store: azure
schema: v13
store: tsdb
server:
grpc_listen_port: 9095
grpc_server_max_recv_msg_size: 60000000
grpc_server_max_send_msg_size: 60000000
http_listen_port: 3100
http_server_idle_timeout: 600s
log_level: debug
storage:
azure:
accountKey: CREDENTIALS
accountName: CREDENTIALS
requestTimeout: 30s
useManagedIdentity: false
bucketNames:
admin: loki
chunks: loki
ruler: loki
type: azure
storage_config:
boltdb_shipper: null
tsdb_shipper:
index_gateway_client:
grpc_client_config:
connect_timeout: 1s
log_gateway_requests: true
tracing:
enabled: true
lokiCanary:
enabled: false
minio:
enabled: false
monitoring:
dashboards:
enabled: false
lokiCanary:
enabled: false
rules:
alerting: false
enabled: false
selfMonitoring:
enabled: false
grafanaAgent:
installOperator: false
serviceMonitor:
enabled: false
metricsInstance:
enabled: false
nameOverride: loki
querier:
extra_query_delay: 500ms
frontend_worker:
grpc_client_config:
max_send_msg_size: 60000000
max_concurrent: 6
query_range:
max_concurrent: 6
parallelise_shardable_queries: true
results_cache:
cache_results: true
cache_validity: 5m
split_queries_by_interval: 1h
query_scheduler:
max_outstanding_requests_per_tenant: 1000
rbac:
namespaced: true
read:
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: projectId
operator: In
values:
- 27edfc6c-eb78-4790-a6c8-ed82a0478f7c
topologyKey: kubernetes.io/hostname
weight: 100
podAntiAffinity: null
extraEnv:
- name: GOMEMLIMIT
valueFrom:
resourceFieldRef:
resource: limits.memory
- name: GOMAXPROCS
valueFrom:
resourceFieldRef:
resource: limits.cpu
nodeSelector:
dedicated: logs-instance
persistence:
enableStatefulSetAutoDeletePVC: true
size: 16Gi
storageClass: csi-cinder-sc-delete
podAnnotations:
port: "3100"
type: loki-read
podLabels:
name: dev-test-multiple-change
scrape: "true"
projectId: 27edfc6c-eb78-4790-a6c8-ed82a0478f7c
app: logs
team: mops
replicas: 3
resources:
limits:
cpu: 3
memory: 2560Mi
requests:
cpu: 100m
memory: 1280Mi
tolerations:
- effect: NoSchedule
key: dedicated
operator: Equal
value: logs-instance
resultsCache:
enabled: false
sidecar:
rules:
enabled: false
test:
enabled: false
write:
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: projectId
operator: In
values:
- 27edfc6c-eb78-4790-a6c8-ed82a0478f7c
topologyKey: kubernetes.io/hostname
weight: 100
podAntiAffinity: null
autoscaling:
behavior:
scaleDown:
policies:
- periodSeconds: 1800
type: Pods
value: 1
stabilizationWindowSeconds: 3600
scaleUp:
policies:
- periodSeconds: 900
type: Pods
value: 1
enabled: true
maxReplicas: 25
minReplicas: 1
targetCPUUtilizationPercentage: 60
targetMemoryUtilizationPercentage: 80
extraEnv:
- name: GOMEMLIMIT
valueFrom:
resourceFieldRef:
resource: limits.memory
- name: GOMAXPROCS
valueFrom:
resourceFieldRef:
resource: limits.cpu
nodeSelector:
dedicated: logs-instance
persistence:
enableStatefulSetAutoDeletePVC: true
size: 16Gi
storageClass: csi-cinder-sc-delete
podAnnotations:
port: "3100"
type: loki-write
podLabels:
scrape: "true"
projectId: 27edfc6c-eb78-4790-a6c8-ed82a0478f7c
app: logs
team: mops
replicas: 1
resources:
limits:
memory: 2474Mi
requests:
cpu: 625m
memory: 2474Mi
tolerations:
- effect: NoSchedule
key: dedicated
operator: Equal
value: logs-instance
- If you have no query problems, redeploy using helm upgrade. Do the process multiple times (can take 1 time or it can 20)
- At some point, you will get a "bad deploymend" and one of the read pods will no longer be able to make queries
Screenshots, Promtail config, or terminal output
So at first, i thought my problem was linked to the previously mentionned issue and it might be linked to what some users are facing since this problem does end up with 504 Gateway timeout when it happens. It does also make this error in our nginx pods that is linked to our ingress:
10.171.106.139 - - [28/Nov/2024:20:59:22 +0000] "GET /loki/api/v1/labels?start=1732826612721000000&end=1732827512721000000 HTTP/1.1" 499 0 "-" "Grafana/11.1.3" 3985 49.972 [6723a512e7641cd9c37269ed-loki-read-3100] [] 172.16.6.5:3100 0 49.971 - 8ee94e8c2e23c4261ecbd3e8c037d4bb
In this case, the bad pod being 172.16.6.5 among my 3 read pods. So seing this error, i tried:
- upgrading to latest version of cilium, now being 1.16.4. Did not help on the issue
- Looking at some threads around github, i tried to set:
socketLB:
enabled: true
terminatePodConnections: true
on my cilium setup, it did not help
- I removed the old index config attached to my grafana loki (boltdb-shipper) and keep only the tsdb_shipper. It did not help.
I then tried to execute the query directly in the pod. Using port forwarding, i then executed the following request to the "tainted" read pod:
http://localhost:50697/loki/api/v1/query_range?direction=backward&end=1732889832897000000&limit=1000&query=%7Bubi_project_id%3D%2227edfc6c-eb78-4790-a6c8-ed82a0478f7c%22%7D+%7C%3D+%60%60&start=1732886232897000000&step=2000ms
This result in a request that just returned...nothing. I waited 30 minutes and POSTMAN and only got this log line confirming the pod received the query:
level=info ts=2024-11-29T15:58:49.425174467Z caller=roundtrip.go:364 org_id=fake traceID=761b6d90af6ae5c8 msg="executing query" type=range query="{ubi_project_id=\"27edfc6c-eb78-4790-a6c8-ed82a0478f7c\"} |= ``" start=2024-11-29T13:17:12.897Z end=2024-11-29T14:17:12.897Z start_delta=2h41m36.528171657s end_delta=1h41m36.528171836s length=1h0m0s step=2000 query_hash=2028412130
When trying the same request to a different pod on the same helm release, i get the correct http response:
And i can see the grafana loki logs telling me the query was a success.
If you do query through grafana, at some point, you will hit the bad pod and result in either EOF error or a 504 gateway timeout. I already posted the nginx eror log, see above.
How to fix
This is a temporary fix, but the only known workaround is to simply restart the read pods. Thats it. If you redo a helm upgrade, there is a chance that the same scenario repeats itself. This should not be a problem and at this point, i almost certain its a Loki problem and not a networking problem. Although, if some Grafana loki dev could help me find a way to check what my read pod is missing or why its in a bad state, ill take any suggestion.
Activity