Do not block task actions on Overlord if segment metadata cache is syncing #17824

kfaraz · 2025-03-20T09:01:39Z

Description

#17653 introduces a cache for segment metadata on the Overlord

If the cache is enabled, the leader Overlord must be fully synced with the metadata store
before it can start serving task actions or perform any other metadata operation that uses
the cache.

On clusters with a large number of used segments, the sync can take a long time
(several minutes on a cluster with 2M segments) which can cause task actions to remain blocked
until the leader is fully synced.

This, in turn, leads to ingestion failures and high lag in case of streaming ingestion.

Changes

Support 3 modes of reading from cache: always, never, only if synced
Allow druid.manager.segments.useCache to take the following values:
- always: Reads are always done from cache. Overlord service start-up is blocked until at least one
  sync with metadata store has finished. Transactions are blocked until at least one sync has finished
  after becoming leader.
- never: Cache is disabled completely
- ifSynced: Reads are done from cache only if cache has already synced with the metadata store.
  Service start-up and transactions do not wait for sync to finish. Writes may still go to cache to reduce sync times.
Emit metrics to track read write, read only and write only transactions
Enable IndexerSQLMetadataStorageCoordinatorTest to run in all 3 modes

Release note

Allow config druid.manager.segments.useCache to take the following values:

always: Reads are always done from cache. Overlord service start-up is blocked until at least once
with metadata store has finished. Transactions are blocked until at least one sync has finished after
becoming leader.
never: Cache is disabled completely
if_synced: Reads are done from cache only if cache has already synced with the metadata store.
Service start-up and transactions do not wait for sync to finish. Writes may still go to cache to reduce sync times.

Emit new metrics:

segment/metadataCache/transactions/readWrite
segment/metadataCache/transactions/readOnly
segment/metadataCache/transactions/writeOnly

These metrics are currently experimental and thus not documented.

This PR has:

gianm

I'm deeply nervous about the concept of addressing performance concerns by bypassing the cache when it's not ready. There are two main reasons:

It makes debugging more complicated. When debugging an issue in production with metadata operations, there would now be the need to first determine which code path the metadata operation went down- did it use the cache or did it read directly?
The cache itself is meant to address performance issues with directly reading from the metadata store. It seems bad that in some cases, we actually have performance issues with the cache that require us to flip back to reading the metadata store. That will bring back the original performance issues with the metadata store. It feels like pushing the problem around rather than truly solving it.

IMO there is a better way to address first-sync performance issues anyway. Currently, Overlords are eligible for leadership even if they haven't completed their first cache sync yet. Let's change that so Overlords aren't eligible for leadership until their first cache sync finishes. That could be accomplished by having the first sync block a @LifecycleStart method, so it happens as part of the start sequence. This way, during a rolling update, we won't have situations where the leader Overlord is unable to perform metadata operations due to syncing its cache.

Finally- as to the behavior in this patch, I'm ok with it existing as long as it's an option. That means the cache behavior should be a tri-state runtime option, perhaps with values like this for druid.manager.segments.useCache:

true or always: always use cache
false or never: never use cache
whenReady: use cache when ready, otherwise directly use metadata store

This would give operators the ability to try out always, and if they run into problems with first sync times, switch to whenReady without having to fall all the way back to never.

One day, I'd like for always to be the only option.

kfaraz · 2025-03-25T23:40:51Z

I see your point, @gianm !
Thanks for the suggestions, I will update this PR accordingly.

…cking

gianm · 2025-03-27T17:43:07Z

docs/configuration/index.md

@@ -1175,8 +1175,8 @@ The following properties pertain to segment metadata caching on the Overlord tha

 |Property|Description|Default|
 |--------|-----------|-------|
-|`druid.manager.segments.useCache`|If `true`, segment metadata is cached on the Overlord to speed up metadata operations.|false|
-|`druid.manager.segments.pollDuration`|Duration (in ISO 8601 format) between successive syncs of the cache with the metadata store. This property is used only when `druid.manager.segments.useCache` is `true`.|`PT1M` (1 minute)|
+|`druid.manager.segments.useCache`|Denotes the usage mode of the segment metadata cache. Possible modes are: (a) `NEVER`: Cache is disable (b) `ALWAYS`: Reads are always done from the cache. Service start-up may be blocked until cache has synced with the metadata store at least once. Transactions may block until cache has synced with the metadata store at least once after becoming leader. (c) `IF_SYNCED`: Reads are done from the cache only if it has already synced with the metadata store. This mode does not block service start-up or transactions.|`NEVER`|


Suggested edits:

"Cache is disabled" (spelling)

"Service start-up will be blocked" (I believe this first sync will always block service startup, so "will" is more accurate)

"Transactions will block" (same reason)

IMO, most runtime property values in Druid are camelCase, so I'd go with never, always, and ifSynced. For an example of a way to do that with an enum, check out OutputChannelMode.

thanks for the suggestions, fixed.

gianm · 2025-03-27T17:44:33Z

...er/src/main/java/org/apache/druid/metadata/segment/cache/HeapMemorySegmentMetadataCache.java

- * Writes can happen even when the sync is in progress. This is because the
- * {@link #syncWithMetadataStore()} is able to reconcile differences based on
- * the update time of a transaction.
+ * Only used segments (excluding num_rows and schema_fingerpring) and


schema_fingerprint (spelling)

gianm · 2025-03-27T17:45:03Z

...er/src/main/java/org/apache/druid/metadata/segment/cache/HeapMemorySegmentMetadataCache.java

+ * Non-leader Overlords also keep polling the metadata store to keep the cache
+ * up-to-date in case leadership changes.
+ * <p>
+ * Cache usage modes: {@link UsageMode}:


Is this line incomplete? It seems to be missing something.

gianm · 2025-03-27T18:05:40Z

...er/src/main/java/org/apache/druid/metadata/segment/cache/HeapMemorySegmentMetadataCache.java

+ * <p>
+ * Cache usage modes: {@link UsageMode}:
+ * <p>
+ * The map {@link #datasourceToSegmentCache} contains the cache for each datasource.


This seems undesirable, since it means that in a situation where there are a lot of datasources being cycled through existing and not-existing (perhaps a cluster being used for testing purposes), the OL will eventually run out of memory. Could it be solved by making datasourceToSegmentCache a regular map (not concurrent map) and protecting fetches + cleanups with that lock? With that approach, the outer lock should be held very briefly, just long enough to do the accounting and transfer to the inner lock.

Yeah, I had kind of left it for a later PR. But for the sake of completeness of this feature, we should probably address it sooner rather than later.

I don't think concurrentmap is really the problem. In fact, it helps keep the design fairly clean and highly concurrent. The problem is that we have given out a datasource cache instance but it is yet to be locked
by a transaction for read or write. So the cleanup thread doesn't know if a transaction is just about to start
working on it, and cannot safely get rid of it.

I had a slightly different approach in mind to solve this problem. It would work with something like permissions.

When a transaction makes a call to getCacheForDatasource(), it gets the cache instance for that datasource. At this point, the permission count of this cache will be incremented.
Then, at some point soon after, this transaction acquires an actual read or write lock on this cache instance and does its thing.
Finally, the transaction releases the read or write lock which also decrements the permissions count.

The cache for a datasource can be completely deleted only when its permission count is 0.

This ensures that if a transaction has gotten an instance of a datasource cache, it cannot be cleaned up even if the cache is not currently locked.

Please let me know what you think. I also feel that maybe this fix should go in a separate PR.

That sort of reference counting approach would work too. Yes, it's OK to do this in a future PR, especially since this area of the code is still evolving and isn't switched on by default yet.

kfaraz added 2 commits March 20, 2025 14:28

Do not use segment metadata cache until leader has synced

7d079fa

Read from cache only when synced, but write even if sync is pending

052c61e

github-actions bot added the Area - Ingestion label Mar 21, 2025

kfaraz marked this pull request as ready for review March 21, 2025 06:45

kfaraz requested a review from gianm March 21, 2025 06:51

Fix compilation

e7075f1

kfaraz changed the title ~~Do not use segment metadata cache until leader has synced~~ Do not block task actions on Overlord if segment metadata cache is syncing Mar 21, 2025

Fix checkstyle, test

4bc587f

gianm reviewed Mar 25, 2025

View reviewed changes

kgyrtkirk added this to the 33.0.0 milestone Mar 26, 2025

kfaraz added 6 commits March 27, 2025 09:28

Revert some extra changes

33a1231

Merge branch 'master' of github.com:apache/druid into ol_cache_nonblo…

8ff9635

…cking

Add 3 modes of cache usage

bfbca09

Move enum to SegmentMetadataCache

edc4d8b

Run tests in all 3 cache modes

f0e94d2

Fix docs and IT configs

c9038c0

github-actions bot added the Area - Documentation label Mar 27, 2025

kfaraz added 3 commits March 27, 2025 15:40

Fix config binding

2d17e9e

Remove forbidden api

4513d78

Merge branch 'master' of github.com:apache/druid into ol_cache_nonblo…

686b275

…cking

gianm reviewed Mar 27, 2025

View reviewed changes

kfaraz added 2 commits March 28, 2025 01:56

Fix typos, docs and enum casing

6a0bdd2

Fix doc

d9c35ca

gianm approved these changes Mar 27, 2025

View reviewed changes

gianm merged commit b187f3d into apache:master Mar 27, 2025
75 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not block task actions on Overlord if segment metadata cache is syncing #17824

Do not block task actions on Overlord if segment metadata cache is syncing #17824

kfaraz commented Mar 20, 2025 •

edited

Loading

gianm left a comment

kfaraz commented Mar 25, 2025

gianm Mar 27, 2025

kfaraz Mar 27, 2025

gianm Mar 27, 2025

gianm Mar 27, 2025

kfaraz Mar 27, 2025

gianm Mar 27, 2025

kfaraz Mar 27, 2025 •

edited

Loading

gianm Mar 27, 2025

Do not block task actions on Overlord if segment metadata cache is syncing #17824

Do not block task actions on Overlord if segment metadata cache is syncing #17824

Conversation

kfaraz commented Mar 20, 2025 • edited Loading

Description

Changes

Release note

gianm left a comment

Choose a reason for hiding this comment

kfaraz commented Mar 25, 2025

gianm Mar 27, 2025

Choose a reason for hiding this comment

kfaraz Mar 27, 2025

Choose a reason for hiding this comment

gianm Mar 27, 2025

Choose a reason for hiding this comment

gianm Mar 27, 2025

Choose a reason for hiding this comment

kfaraz Mar 27, 2025

Choose a reason for hiding this comment

gianm Mar 27, 2025

Choose a reason for hiding this comment

kfaraz Mar 27, 2025 • edited Loading

Choose a reason for hiding this comment

gianm Mar 27, 2025

Choose a reason for hiding this comment

kfaraz commented Mar 20, 2025 •

edited

Loading

kfaraz Mar 27, 2025 •

edited

Loading