-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not block task actions on Overlord if segment metadata cache is syncing #17824
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm deeply nervous about the concept of addressing performance concerns by bypassing the cache when it's not ready. There are two main reasons:
- It makes debugging more complicated. When debugging an issue in production with metadata operations, there would now be the need to first determine which code path the metadata operation went down- did it use the cache or did it read directly?
- The cache itself is meant to address performance issues with directly reading from the metadata store. It seems bad that in some cases, we actually have performance issues with the cache that require us to flip back to reading the metadata store. That will bring back the original performance issues with the metadata store. It feels like pushing the problem around rather than truly solving it.
IMO there is a better way to address first-sync performance issues anyway. Currently, Overlords are eligible for leadership even if they haven't completed their first cache sync yet. Let's change that so Overlords aren't eligible for leadership until their first cache sync finishes. That could be accomplished by having the first sync block a @LifecycleStart
method, so it happens as part of the start sequence. This way, during a rolling update, we won't have situations where the leader Overlord is unable to perform metadata operations due to syncing its cache.
Finally- as to the behavior in this patch, I'm ok with it existing as long as it's an option. That means the cache behavior should be a tri-state runtime option, perhaps with values like this for druid.manager.segments.useCache
:
true
oralways
: always use cachefalse
ornever
: never use cachewhenReady
: use cache when ready, otherwise directly use metadata store
This would give operators the ability to try out always
, and if they run into problems with first sync times, switch to whenReady
without having to fall all the way back to never
.
One day, I'd like for always
to be the only option.
I see your point, @gianm ! |
docs/configuration/index.md
Outdated
@@ -1175,8 +1175,8 @@ The following properties pertain to segment metadata caching on the Overlord tha | |||
|
|||
|Property|Description|Default| | |||
|--------|-----------|-------| | |||
|`druid.manager.segments.useCache`|If `true`, segment metadata is cached on the Overlord to speed up metadata operations.|false| | |||
|`druid.manager.segments.pollDuration`|Duration (in ISO 8601 format) between successive syncs of the cache with the metadata store. This property is used only when `druid.manager.segments.useCache` is `true`.|`PT1M` (1 minute)| | |||
|`druid.manager.segments.useCache`|Denotes the usage mode of the segment metadata cache. Possible modes are: (a) `NEVER`: Cache is disable (b) `ALWAYS`: Reads are always done from the cache. Service start-up may be blocked until cache has synced with the metadata store at least once. Transactions may block until cache has synced with the metadata store at least once after becoming leader. (c) `IF_SYNCED`: Reads are done from the cache only if it has already synced with the metadata store. This mode does not block service start-up or transactions.|`NEVER`| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested edits:
- "Cache is disabled" (spelling)
- "Service start-up will be blocked" (I believe this first sync will always block service startup, so "will" is more accurate)
- "Transactions will block" (same reason)
- IMO, most runtime property values in Druid are camelCase, so I'd go with
never
,always
, andifSynced
. For an example of a way to do that with an enum, check outOutputChannelMode
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the suggestions, fixed.
* Writes can happen even when the sync is in progress. This is because the | ||
* {@link #syncWithMetadataStore()} is able to reconcile differences based on | ||
* the update time of a transaction. | ||
* Only used segments (excluding num_rows and schema_fingerpring) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
schema_fingerprint
(spelling)
* Non-leader Overlords also keep polling the metadata store to keep the cache | ||
* up-to-date in case leadership changes. | ||
* <p> | ||
* Cache usage modes: {@link UsageMode}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this line incomplete? It seems to be missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
* <p> | ||
* Cache usage modes: {@link UsageMode}: | ||
* <p> | ||
* The map {@link #datasourceToSegmentCache} contains the cache for each datasource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems undesirable, since it means that in a situation where there are a lot of datasources being cycled through existing and not-existing (perhaps a cluster being used for testing purposes), the OL will eventually run out of memory. Could it be solved by making datasourceToSegmentCache
a regular map (not concurrent map) and protecting fetches + cleanups with that lock? With that approach, the outer lock should be held very briefly, just long enough to do the accounting and transfer to the inner lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I had kind of left it for a later PR. But for the sake of completeness of this feature, we should probably address it sooner rather than later.
I don't think concurrentmap is really the problem. In fact, it helps keep the design fairly clean and highly concurrent. The problem is that we have given out a datasource cache instance but it is yet to be locked
by a transaction for read or write. So the cleanup thread doesn't know if a transaction is just about to start
working on it, and cannot safely get rid of it.
I had a slightly different approach in mind to solve this problem. It would work with something like permissions.
When a transaction makes a call to getCacheForDatasource()
, it gets the cache instance for that datasource. At this point, the permission count of this cache will be incremented.
Then, at some point soon after, this transaction acquires an actual read or write lock on this cache instance and does its thing.
Finally, the transaction releases the read or write lock which also decrements the permissions count.
The cache for a datasource can be completely deleted only when its permission count is 0.
This ensures that if a transaction has gotten an instance of a datasource cache, it cannot be cleaned up even if the cache is not currently locked.
Please let me know what you think. I also feel that maybe this fix should go in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sort of reference counting approach would work too. Yes, it's OK to do this in a future PR, especially since this area of the code is still evolving and isn't switched on by default yet.
Description
#17653 introduces a cache for segment metadata on the Overlord
If the cache is enabled, the leader Overlord must be fully synced with the metadata store
before it can start serving task actions or perform any other metadata operation that uses
the cache.
On clusters with a large number of used segments, the sync can take a long time
(several minutes on a cluster with 2M segments) which can cause task actions to remain blocked
until the leader is fully synced.
This, in turn, leads to ingestion failures and high lag in case of streaming ingestion.
Changes
druid.manager.segments.useCache
to take the following values:always
: Reads are always done from cache. Overlord service start-up is blocked until at least onesync with metadata store has finished. Transactions are blocked until at least one sync has finished
after becoming leader.
never
: Cache is disabled completelyifSynced
: Reads are done from cache only if cache has already synced with the metadata store.Service start-up and transactions do not wait for sync to finish. Writes may still go to cache to reduce sync times.
IndexerSQLMetadataStorageCoordinatorTest
to run in all 3 modesRelease note
Allow config
druid.manager.segments.useCache
to take the following values:always
: Reads are always done from cache. Overlord service start-up is blocked until at least oncewith metadata store has finished. Transactions are blocked until at least one sync has finished after
becoming leader.
never
: Cache is disabled completelyif_synced
: Reads are done from cache only if cache has already synced with the metadata store.Service start-up and transactions do not wait for sync to finish. Writes may still go to cache to reduce sync times.
Emit new metrics:
segment/metadataCache/transactions/readWrite
segment/metadataCache/transactions/readOnly
segment/metadataCache/transactions/writeOnly
These metrics are currently experimental and thus not documented.
This PR has: