Skip to content

Compaction tasks fail with read timeout against overlord intermittently #17849

Open
@zargor

Description

@zargor

Running compaction MM-less setup with 200 task slots, there is an intermittent exception raised with the result of a failing task(s).

So, compaction tasks fail with read timeout against overlord, usually more noticed during peak traffic time.

Affected Version

v30.0.0

Description

In peak traffic time we ingest by 200+ index_kafka tasks 6-7M per minute rate messages.
Segment granularity: 1H
Compaction task slots: 200
Middle managers count: 200+
Overlord client conns: druid.global.http.numConnections=500
Coordinator client conns: druid.global.http.numConnections=200

Intermitttent error message with which compaction task ends by failing:

2025-03-27T16:45:21,444 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - TaskMonitor is initialized with estimatedNumSucceededTasks[245]
2025-03-27T16:45:21,444 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Starting taskMonitor
2025-03-27T16:45:21,445 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexPhaseRunner - Submitting initial tasks
2025-03-27T16:45:21,445 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexPhaseRunner - Submitting a new task for spec[coordinator-issued_compact_vlf_chhmnlfh_2025-03-27T15:59:59.432Z_0_0]
2025-03-27T16:47:21,454 INFO [ServiceClientFactory-1] org.apache.druid.rpc.ServiceClientImpl - Service [overlord] request [POST http://100.64.141.171:8088/druid/indexer/v1/task] encountered exception on attempt #1; retrying in 100 ms (org.jboss.netty.handler.timeout.ReadTimeoutException: [POST http://100.64.141.171:8088/druid/indexer/v1/task] Read timed out)
2025-03-27T16:48:22,172 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexPhaseRunner - Cleaning up resources
2025-03-27T16:48:22,172 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Stopped taskMonitor

Some debugging

  1. Although not sure about the cycle of this query, I wonder if this somehow affects overlord performance in some way. This one is about metadata store and it is SQL which kicks in periodically:
SELECT `payload` FROM `druid_segments` WHERE `used` = TRUE 

avg latency: 51029.51
rows: 1939077.47

  1. Since the issue is related to timeout compaction task to overlord, I think I found which timeout is in place.
    With RequestBuider there is 2m timeout which is not configurable.
    Suppose overlord proxy is handling client requests, and if true, there is no possibility to tune its configuration either.

Further questions and thoughts

  1. Should we fork the repo in order to be able to tweak RequestBuilder/ProxyServlet configuration?

  2. Could you suggest some config options that we may take into account regarding overlord responsiveness (it has enough resources though: 12-16cpu/64Gmem)

  3. Following http client config options, seems we cannot really touch the communication between clients (say compaction tasks) and overlord proxy. Though we'll try to increase druid.global.http.clientConnectTimeout which defaults to 500ms

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions