Description
Running compaction MM-less setup with 200
task slots, there is an intermittent exception raised with the result of a failing task(s).
So, compaction tasks fail with read timeout against overlord, usually more noticed during peak traffic time.
Affected Version
v30.0.0
Description
In peak traffic time we ingest by 200+
index_kafka
tasks 6-7M
per minute rate messages.
Segment granularity: 1H
Compaction task slots: 200
Middle managers count: 200+
Overlord client conns: druid.global.http.numConnections=500
Coordinator client conns: druid.global.http.numConnections=200
Intermitttent error message with which compaction task ends by failing:
2025-03-27T16:45:21,444 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - TaskMonitor is initialized with estimatedNumSucceededTasks[245]
2025-03-27T16:45:21,444 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Starting taskMonitor
2025-03-27T16:45:21,445 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexPhaseRunner - Submitting initial tasks
2025-03-27T16:45:21,445 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexPhaseRunner - Submitting a new task for spec[coordinator-issued_compact_vlf_chhmnlfh_2025-03-27T15:59:59.432Z_0_0]
2025-03-27T16:47:21,454 INFO [ServiceClientFactory-1] org.apache.druid.rpc.ServiceClientImpl - Service [overlord] request [POST http://100.64.141.171:8088/druid/indexer/v1/task] encountered exception on attempt #1; retrying in 100 ms (org.jboss.netty.handler.timeout.ReadTimeoutException: [POST http://100.64.141.171:8088/druid/indexer/v1/task] Read timed out)
2025-03-27T16:48:22,172 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexPhaseRunner - Cleaning up resources
2025-03-27T16:48:22,172 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Stopped taskMonitor
Some debugging
- Although not sure about the cycle of this query, I wonder if this somehow affects overlord performance in some way. This one is about metadata store and it is SQL which kicks in periodically:
SELECT `payload` FROM `druid_segments` WHERE `used` = TRUE
avg latency: 51029.51
rows: 1939077.47
- Since the issue is related to timeout compaction task to overlord, I think I found which timeout is in place.
With RequestBuider there is2m
timeout which is not configurable.
Suppose overlord proxy is handling client requests, and if true, there is no possibility to tune its configuration either.
Further questions and thoughts
-
Should we fork the repo in order to be able to tweak
RequestBuilder/ProxyServlet
configuration? -
Could you suggest some config options that we may take into account regarding overlord responsiveness (it has enough resources though:
12-16cpu/64Gmem
) -
Following http client config options, seems we cannot really touch the communication between clients (say compaction tasks) and overlord proxy. Though we'll try to increase
druid.global.http.clientConnectTimeout
which defaults to500ms