Compaction tasks fail with read timeout against overlord intermittently

Running compaction MM-less setup with `200` task slots, there is an intermittent exception raised with the result of a failing task(s).

So, compaction tasks fail with read timeout against overlord, usually more noticed during peak traffic time.

### Affected Version

`v30.0.0`

### Description

In peak traffic time we ingest by `200+` `index_kafka` tasks `6-7M` per minute rate messages.
Segment granularity:        `1H`
Compaction task slots:     `200`
Middle managers count:   `200+`
Overlord client conns:       `druid.global.http.numConnections=500`
Coordinator client conns:  `druid.global.http.numConnections=200`

Intermitttent error message with which compaction task ends by failing:
```
2025-03-27T16:45:21,444 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - TaskMonitor is initialized with estimatedNumSucceededTasks[245]
2025-03-27T16:45:21,444 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Starting taskMonitor
2025-03-27T16:45:21,445 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexPhaseRunner - Submitting initial tasks
2025-03-27T16:45:21,445 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexPhaseRunner - Submitting a new task for spec[coordinator-issued_compact_vlf_chhmnlfh_2025-03-27T15:59:59.432Z_0_0]
2025-03-27T16:47:21,454 INFO [ServiceClientFactory-1] org.apache.druid.rpc.ServiceClientImpl - Service [overlord] request [POST http://100.64.141.171:8088/druid/indexer/v1/task] encountered exception on attempt #1; retrying in 100 ms (org.jboss.netty.handler.timeout.ReadTimeoutException: [POST http://100.64.141.171:8088/druid/indexer/v1/task] Read timed out)
2025-03-27T16:48:22,172 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexPhaseRunner - Cleaning up resources
2025-03-27T16:48:22,172 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Stopped taskMonitor

``` 

### Some debugging

1. Although not sure about the cycle of this query, I wonder if this somehow affects overlord performance in some way. This one is about metadata store and it is SQL which kicks in periodically:
```
SELECT `payload` FROM `druid_segments` WHERE `used` = TRUE 
```

avg latency: 51029.51
rows:             1939077.47

2. Since the issue is related to timeout compaction task to overlord, I think I found which timeout is in place.
With [RequestBuider](https://github.com/apache/druid/blob/cee06f0b1ef6d5c8806a94a387c7b5aa5da2b302/server/src/main/java/org/apache/druid/rpc/RequestBuilder.java#L46) there is `2m` timeout which is not configurable. 
Suppose [overlord proxy](https://github.com/apache/druid/blob/cee06f0/server/src/main/java/org/apache/druid/server/http/OverlordProxyServlet.java) is handling client requests, and if true, there is no possibility to tune its configuration either.

### Further questions and thoughts

1. Should we fork the repo in order to be able to tweak `RequestBuilder/ProxyServlet` configuration?

2. Could you suggest some config options that we may take into account regarding overlord responsiveness (it has enough resources though: `12-16cpu/64Gmem`)

3. Following [http client](https://druid.apache.org/docs/latest/configuration/#http-client) config options, seems we cannot really touch the communication between clients (say compaction tasks) and overlord proxy. Though we'll try to increase `druid.global.http.clientConnectTimeout` which defaults to `500ms`

 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compaction tasks fail with read timeout against overlord intermittently #17849

Affected Version

Description

Some debugging

Further questions and thoughts

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Compaction tasks fail with read timeout against overlord intermittently #17849

Description

Affected Version

Description

Some debugging

Further questions and thoughts

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions