Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not block task actions on Overlord if segment metadata cache is syncing #17824

Merged
merged 15 commits into from
Mar 27, 2025

Conversation

kfaraz
Copy link
Contributor

@kfaraz kfaraz commented Mar 20, 2025

Description

#17653 introduces a cache for segment metadata on the Overlord

If the cache is enabled, the leader Overlord must be fully synced with the metadata store
before it can start serving task actions or perform any other metadata operation that uses
the cache.

On clusters with a large number of used segments, the sync can take a long time
(several minutes on a cluster with 2M segments) which can cause task actions to remain blocked
until the leader is fully synced.

This, in turn, leads to ingestion failures and high lag in case of streaming ingestion.

Changes

  • Support 3 modes of reading from cache: always, never, only if synced
  • Allow druid.manager.segments.useCache to take the following values:
    • always: Reads are always done from cache. Overlord service start-up is blocked until at least one
      sync with metadata store has finished. Transactions are blocked until at least one sync has finished
      after becoming leader.
    • never: Cache is disabled completely
    • ifSynced: Reads are done from cache only if cache has already synced with the metadata store.
      Service start-up and transactions do not wait for sync to finish. Writes may still go to cache to reduce sync times.
  • Emit metrics to track read write, read only and write only transactions
  • Enable IndexerSQLMetadataStorageCoordinatorTest to run in all 3 modes

Release note

Allow config druid.manager.segments.useCache to take the following values:

  • always: Reads are always done from cache. Overlord service start-up is blocked until at least once
    with metadata store has finished. Transactions are blocked until at least one sync has finished after
    becoming leader.
  • never: Cache is disabled completely
  • if_synced: Reads are done from cache only if cache has already synced with the metadata store.
    Service start-up and transactions do not wait for sync to finish. Writes may still go to cache to reduce sync times.

Emit new metrics:

  • segment/metadataCache/transactions/readWrite
  • segment/metadataCache/transactions/readOnly
  • segment/metadataCache/transactions/writeOnly

These metrics are currently experimental and thus not documented.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@kfaraz kfaraz marked this pull request as ready for review March 21, 2025 06:45
@kfaraz kfaraz requested a review from gianm March 21, 2025 06:51
@kfaraz kfaraz changed the title Do not use segment metadata cache until leader has synced Do not block task actions on Overlord if segment metadata cache is syncing Mar 21, 2025
Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm deeply nervous about the concept of addressing performance concerns by bypassing the cache when it's not ready. There are two main reasons:

  • It makes debugging more complicated. When debugging an issue in production with metadata operations, there would now be the need to first determine which code path the metadata operation went down- did it use the cache or did it read directly?
  • The cache itself is meant to address performance issues with directly reading from the metadata store. It seems bad that in some cases, we actually have performance issues with the cache that require us to flip back to reading the metadata store. That will bring back the original performance issues with the metadata store. It feels like pushing the problem around rather than truly solving it.

IMO there is a better way to address first-sync performance issues anyway. Currently, Overlords are eligible for leadership even if they haven't completed their first cache sync yet. Let's change that so Overlords aren't eligible for leadership until their first cache sync finishes. That could be accomplished by having the first sync block a @LifecycleStart method, so it happens as part of the start sequence. This way, during a rolling update, we won't have situations where the leader Overlord is unable to perform metadata operations due to syncing its cache.

Finally- as to the behavior in this patch, I'm ok with it existing as long as it's an option. That means the cache behavior should be a tri-state runtime option, perhaps with values like this for druid.manager.segments.useCache:

  • true or always: always use cache
  • false or never: never use cache
  • whenReady: use cache when ready, otherwise directly use metadata store

This would give operators the ability to try out always, and if they run into problems with first sync times, switch to whenReady without having to fall all the way back to never.

One day, I'd like for always to be the only option.

@kfaraz
Copy link
Contributor Author

kfaraz commented Mar 25, 2025

I see your point, @gianm !
Thanks for the suggestions, I will update this PR accordingly.

@kgyrtkirk kgyrtkirk added this to the 33.0.0 milestone Mar 26, 2025
@@ -1175,8 +1175,8 @@ The following properties pertain to segment metadata caching on the Overlord tha

|Property|Description|Default|
|--------|-----------|-------|
|`druid.manager.segments.useCache`|If `true`, segment metadata is cached on the Overlord to speed up metadata operations.|false|
|`druid.manager.segments.pollDuration`|Duration (in ISO 8601 format) between successive syncs of the cache with the metadata store. This property is used only when `druid.manager.segments.useCache` is `true`.|`PT1M` (1 minute)|
|`druid.manager.segments.useCache`|Denotes the usage mode of the segment metadata cache. Possible modes are: (a) `NEVER`: Cache is disable (b) `ALWAYS`: Reads are always done from the cache. Service start-up may be blocked until cache has synced with the metadata store at least once. Transactions may block until cache has synced with the metadata store at least once after becoming leader. (c) `IF_SYNCED`: Reads are done from the cache only if it has already synced with the metadata store. This mode does not block service start-up or transactions.|`NEVER`|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested edits:

  • "Cache is disabled" (spelling)
  • "Service start-up will be blocked" (I believe this first sync will always block service startup, so "will" is more accurate)
  • "Transactions will block" (same reason)
  • IMO, most runtime property values in Druid are camelCase, so I'd go with never, always, and ifSynced. For an example of a way to do that with an enum, check out OutputChannelMode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestions, fixed.

* Writes can happen even when the sync is in progress. This is because the
* {@link #syncWithMetadataStore()} is able to reconcile differences based on
* the update time of a transaction.
* Only used segments (excluding num_rows and schema_fingerpring) and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

schema_fingerprint (spelling)

* Non-leader Overlords also keep polling the metadata store to keep the cache
* up-to-date in case leadership changes.
* <p>
* Cache usage modes: {@link UsageMode}:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this line incomplete? It seems to be missing something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

* <p>
* Cache usage modes: {@link UsageMode}:
* <p>
* The map {@link #datasourceToSegmentCache} contains the cache for each datasource.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems undesirable, since it means that in a situation where there are a lot of datasources being cycled through existing and not-existing (perhaps a cluster being used for testing purposes), the OL will eventually run out of memory. Could it be solved by making datasourceToSegmentCache a regular map (not concurrent map) and protecting fetches + cleanups with that lock? With that approach, the outer lock should be held very briefly, just long enough to do the accounting and transfer to the inner lock.

Copy link
Contributor Author

@kfaraz kfaraz Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I had kind of left it for a later PR. But for the sake of completeness of this feature, we should probably address it sooner rather than later.

I don't think concurrentmap is really the problem. In fact, it helps keep the design fairly clean and highly concurrent. The problem is that we have given out a datasource cache instance but it is yet to be locked
by a transaction for read or write. So the cleanup thread doesn't know if a transaction is just about to start
working on it, and cannot safely get rid of it.

I had a slightly different approach in mind to solve this problem. It would work with something like permissions.

When a transaction makes a call to getCacheForDatasource(), it gets the cache instance for that datasource. At this point, the permission count of this cache will be incremented.
Then, at some point soon after, this transaction acquires an actual read or write lock on this cache instance and does its thing.
Finally, the transaction releases the read or write lock which also decrements the permissions count.

The cache for a datasource can be completely deleted only when its permission count is 0.

This ensures that if a transaction has gotten an instance of a datasource cache, it cannot be cleaned up even if the cache is not currently locked.

Please let me know what you think. I also feel that maybe this fix should go in a separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sort of reference counting approach would work too. Yes, it's OK to do this in a future PR, especially since this area of the code is still evolving and isn't switched on by default yet.

@gianm gianm merged commit b187f3d into apache:master Mar 27, 2025
75 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants