-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
split KillUnusedSegmentsTask to processing in smaller chunks #14642
split KillUnusedSegmentsTask to processing in smaller chunks #14642
Conversation
@kfaraz @TSFenwick looking for input from you both on this, currently it's a draft:
I intend to run this on one of our clusters during the week next week. cc @maytasm. |
This makes sense, @jasonk000 .
|
Also, I was looking at the Now the
Since we already have the interval, I wonder if we even need to fire multiple DELETE statements and if just a single one with the right Since we are trying to optimize this code path, I think we should explore that angle as well. |
Druid has a feature where the Coordinator will submit tasks periodically to kill unused segments (this is enable by Just some thoughts I have:
|
Yes, @maytasm , I agree. With the recent improvements made in #14639 and the automatic Coordinator kill being a thing, I am not sure if we would really gain a lot of perf benefits here in most cases. But since we cannot prevent users from submitting a In my opinion, the best way forward is to check if we can fire a single cc: @jasonk000 |
This makes sense to me, though I'm not entirely clear on the segment rules around this. If we did this, it would change scope a bit: we take in a (unordered) Set, it's not obvious at this layer to me that we have all the entries in the Interval, so we might delete unintended files. I would think for best results the API needs to change from
Yes, - today if we issue a task for "delete 10 million segments", it will locks up the overlord/cluster. Even though S3 and SQL batching will make it much much faster, it won't solve the cluster stability, just make the segment count to lock it up bigger. This this lock being held too long brings most overlord guided activity -- including ingestion -- to a halt since ingestion tasks can't get segments allocated and are forced to wait.
I agree, though, we can easily imagine a situation where a user doesn't realise automatic kill task and needs to do a catch-up (me!), or even a change in requirements to reduce or change storage configurations means a user no longer need 24mo but only need 12mo data. In these situations, a user should be able to issue a task and the system stay stable.
Could be, it doesn't at the moment. Would you use it in
This is an unresolved problem today: the SQL delete happens inside a transaction which is committed before the S3 deletes are issued. So, if any files fail to delete from S3, the segments are still gone from SQL. I'm not sure if there's a reconciliation process.
Makes sense to me too. |
Just to be clear, I don't see this as a "performance" thing, but rather, a "stability" thing. Issuing a large kill task currently will block the overlord from operating. With this patch, the overlord will continue operating. |
|
After discussion with @maytasm (thanks!) , we think it's best to move the batching up from Segment Nuke to the |
@maytasm @kfaraz how do we feel now about this PR? I've focused the change on the one specific problem area, and added some extra comments to explain the code a bit more thoroughly. There's an opportunity that I'm ignoring here for the moment, similar to @kfaraz suggestion, which is that the Retrieve Unused Tasks can limit the number retrieved, or divide and request a smaller interval and iterate intervals in smaller blocks. It looks like RetrieveUnusedSegmentsAction is also used by MoveTask, ArchiveTask, and RestoreTask in addition to KillUnusedSegmentsTask. |
Processing in smaller chunks allows the task execution to yield the TaskLockbox lock, which allows the overlord to continue being responsive to other tasks and users while this particular kill task is executing.
d08fcff
to
a372da0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks the for changes, @jasonk000 !
I have left some comments. Could you please also update the PR description to reflect the latest set of changes?
...xing-service/src/main/java/org/apache/druid/indexing/common/task/KillUnusedSegmentsTask.java
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/common/task/KillUnusedSegmentsTask.java
Outdated
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/common/task/KillUnusedSegmentsTask.java
Fixed
Show fixed
Hide fixed
...vice/src/test/java/org/apache/druid/indexing/test/TestIndexerMetadataStorageCoordinator.java
Dismissed
Show dismissed
Hide dismissed
...xing-service/src/main/java/org/apache/druid/indexing/common/task/KillUnusedSegmentsTask.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I have some minor comments but is +1 on the design/approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the overall approach given that we want the change to have a small footprint while reducing the lock burden on Overlord.
I have left some minor comments. +1 after those are addressed.
...xing-service/src/main/java/org/apache/druid/indexing/common/task/KillUnusedSegmentsTask.java
Outdated
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/common/task/KillUnusedSegmentsTask.java
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/common/task/KillUnusedSegmentsTask.java
Outdated
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/common/task/KillUnusedSegmentsTask.java
Outdated
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/common/task/KillUnusedSegmentsTask.java
Outdated
Show resolved
Hide resolved
* provide a longer explanation for kill task batchSize parameter * add logging details for kill batch progress * javadoc and field name changes
sample logging output during test
|
...xing-service/src/main/java/org/apache/druid/indexing/common/task/KillUnusedSegmentsTask.java
Fixed
Show fixed
Hide fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor logging and doc suggestions.
...xing-service/src/main/java/org/apache/druid/indexing/common/task/KillUnusedSegmentsTask.java
Outdated
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/common/task/KillUnusedSegmentsTask.java
Outdated
Show resolved
Hide resolved
...xing-service/src/main/java/org/apache/druid/indexing/common/task/KillUnusedSegmentsTask.java
Outdated
Show resolved
Hide resolved
better description of the task parameters Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
better description of batch size parameter Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
…indexing/common/task/KillUnusedSegmentsTask.java updated druid logging style Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
…indexing/common/task/KillUnusedSegmentsTask.java updated druid logging style Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
…indexing/common/task/KillUnusedSegmentsTask.java updated druid logging style Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
...xing-service/src/main/java/org/apache/druid/indexing/common/task/KillUnusedSegmentsTask.java
Dismissed
Show dismissed
Hide dismissed
@jasonk000 - can you please add some release notes to the PR description? |
@abhishekagarwal87 added some notes to description. |
Related to #14131, #14634, and #14639.
Description
Split the KillUnusedSegmentsTask to process in smaller batches. The segment nuke runs inside a critical section inside the
TaskLockbox
. This blocks up most activity in the overlord, including allocation of new segments, while the nuke is running. If the nuke runs for a long time, it can cause issues with cluster stability.This change relieves pressure on the
TaskLockbox
by splitting the nuke work into smaller chunks, which will allow other tasks to interleave with the lockbox.Key changed/added classes in this PR
KillUnusedSegmentsTask
: partition the segments into batches of size 10000 for nuke.This PR has:
Release notes
A new optional parameter has been introduced to the kill task description, which allows to perform the kill in batches (vs all at once). This should allow better stability to issue larger and longer running tasks.