split KillUnusedSegmentsTask to processing in smaller chunks #14642

jasonk000 · 2023-07-23T20:09:03Z

Description

Split the KillUnusedSegmentsTask to process in smaller batches. The segment nuke runs inside a critical section inside the TaskLockbox. This blocks up most activity in the overlord, including allocation of new segments, while the nuke is running. If the nuke runs for a long time, it can cause issues with cluster stability.

This change relieves pressure on the TaskLockbox by splitting the nuke work into smaller chunks, which will allow other tasks to interleave with the lockbox.

Key changed/added classes in this PR

KillUnusedSegmentsTask: partition the segments into batches of size 10000 for nuke.

This PR has:

been self-reviewed.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
been tested in a test Druid cluster.

Release notes

A new optional parameter has been introduced to the kill task description, which allows to perform the kill in batches (vs all at once). This should allow better stability to issue larger and longer running tasks.

jasonk000 · 2023-07-23T20:10:19Z

@kfaraz @TSFenwick looking for input from you both on this, currently it's a draft:

should the 1000 value be configurable, and if so, what's the best mechanism to introduce this?
currently there's no TC for SegmentNukeAction, is it warranted to introduce one here?

I intend to run this on one of our clusters during the week next week. cc @maytasm.

kfaraz · 2023-07-24T03:43:40Z

This makes sense, @jasonk000 .

The batch size need not be configurable and can probably even be a little higher.
You can try to benchmark and determine the optimal batch size during your testing.
Also, instead of having a task action return type of void, we can return the number of segments actually deleted. This becomes more important now since we are breaking up the transaction.

kfaraz · 2023-07-24T03:43:58Z

Also, I was looking at the SegmentNukeAction and the KillUnusedSegmentsTask is the only place where this is used.

Now the kill task essentially does the following:

For a given interval, mark segments as unused
Fetch all the unused segments in that interval
Delete all the fetched unused segments

Since we already have the interval, I wonder if we even need to fire multiple DELETE statements and if just a single one with the right WHERE clause on start, end, datasource and used would suffice. There are probably some nuances here such as the IntervalMode used while retrieving the unused segments, etc, but nothing that cannot be wired up.

Since we are trying to optimize this code path, I think we should explore that angle as well.

maytasm · 2023-07-24T05:17:47Z

Druid has a feature where the Coordinator will submit tasks periodically to kill unused segments (this is enable by druid.coordinator.kill.on). When this feature is enable, it also has a config druid.coordinator.kill.maxSegments which control maximum number of segment to kill per kill task submission. The default value is 100.

Just some thoughts I have:

If we already speed up segment nuke by batching the SQL Metadata deleteSegments ((perf) server: batch SQL Metadata deleteSegments #14639), do we still need this?
If automatic kill task (druid.coordinator.kill.on) is our paved path, should we encourage users to use that (and set the druid.coordinator.kill.maxSegments, etc) over manually submitting kill task? If user manually submit kill task, they would not have any guardrail that automatic kill task provide (such as too many segments in one kill task, killing interval that they did not intended, datasource whitelist/blacklist, etc).
Do we need the segment nuke action to return the segment
What happen to the caller of the segment nuke action (i.e. KillTask) if some batches succeeded and some batches failed? The number of failed/succeeded may not be enough and we may need the actual id of the segments right?
If we do go ahead with this change, I think not having a new config is better.

kfaraz · 2023-07-24T06:23:29Z

Yes, @maytasm , I agree. With the recent improvements made in #14639 and the automatic Coordinator kill being a thing, I am not sure if we would really gain a lot of perf benefits here in most cases.

But since we cannot prevent users from submitting a kill task (it is allowed by the console after all), it would make sense to improve this action, especially since it is done in a critical section of the TaskLockbox.

In my opinion, the best way forward is to check if we can fire a single DELETE with the right filter on interval, rather than a batch of DELETEs. It would not require us to have any new configs or worry about partial success.

cc: @jasonk000

jasonk000 · 2023-07-24T18:51:09Z

Since we already have the interval, I wonder if we even need to fire multiple DELETE statements and if just a single one with the right WHERE clause on start, end, datasource and used would suffice. There are probably some nuances here such as the IntervalMode used while retrieving the unused segments, etc, but nothing that cannot be wired up.

This makes sense to me, though I'm not entirely clear on the segment rules around this. If we did this, it would change scope a bit: we take in a (unordered) Set, it's not obvious at this layer to me that we have all the entries in the Interval, so we might delete unintended files. I would think for best results the API needs to change from deleteSegments() to deleteSegmentsInInterval()?

If we already speed up segment nuke by batching the SQL Metadata deleteSegments, do we still need this?

Yes, - today if we issue a task for "delete 10 million segments", it will locks up the overlord/cluster. Even though S3 and SQL batching will make it much much faster, it won't solve the cluster stability, just make the segment count to lock it up bigger. This this lock being held too long brings most overlord guided activity -- including ingestion -- to a halt since ingestion tasks can't get segments allocated and are forced to wait.

If automatic kill task (druid.coordinator.kill.on) is our paved path, should we encourage users to use that (and set the druid.coordinator.kill.maxSegments, etc) over manually submitting kill task? If user manually submit kill task, they would not have any guardrail that automatic kill task provide (such as too many segments in one kill task, killing interval that they did not intended, datasource whitelist/blacklist, etc).

I agree, though, we can easily imagine a situation where a user doesn't realise automatic kill task and needs to do a catch-up (me!), or even a change in requirements to reduce or change storage configurations means a user no longer need 24mo but only need 12mo data. In these situations, a user should be able to issue a task and the system stay stable.

Do we need the segment nuke action to return the segment

Could be, it doesn't at the moment. Would you use it in KillTask or just for logging?

What happen to the caller of the segment nuke action (i.e. KillTask) if some batches succeeded and some batches failed? The number of failed/succeeded may not be enough and we may need the actual id of the segments right?

This is an unresolved problem today: the SQL delete happens inside a transaction which is committed before the S3 deletes are issued. So, if any files fail to delete from S3, the segments are still gone from SQL. I'm not sure if there's a reconciliation process.

If we do go ahead with this change, I think not having a new config is better.

Makes sense to me too.

jasonk000 · 2023-07-24T19:08:28Z

Just to be clear, I don't see this as a "performance" thing, but rather, a "stability" thing. Issuing a large kill task currently will block the overlord from operating. With this patch, the overlord will continue operating.

maytasm · 2023-07-24T20:29:13Z

I believe KillUnusedSegments actually convert interval to segments before passing to SegmentNukeAction. (KillUnusedSegments uses RetrieveUnusedSegmentsAction to find unused segments for a given datasource + interval). So we actually started with an Interval. I am not sure if there is any reason that we are using segment ids rather than interval (as suggested by @kfaraz).
I am convinced we should make this change. KillTask can still be issued manually and the number of segments in each KillTask can easily be very large without the user realizing it.
For returning the segment actually deleted from SegmentNukeAction, I was thinking we can use this to determine which segments we want to delete from s3. (skip deleting from s3 if SegmentNukeAction failed to remove segment from metadatastore)
If any files fail to delete from S3, the segments are still gone from SQL -> I think this is less of a problem than if the file is deleted from s3 but still remains in SQL. If files fail to delete from S3, it does not effect Druid as Druid would considered them as deleted (gone forever). The s3 leak can be reconcile with s3 retention policy where the policy delete really old files (much greater than Druid retention policy to clean up the leak). if the file is deleted from s3 but still remains in SQL, Druid allow user to bring back unused segments (marking unused as used) which would cause problem if the actual files are no longer in deep storage!

jasonk000 · 2023-07-25T00:08:10Z

After discussion with @maytasm (thanks!) , we think it's best to move the batching up from Segment Nuke to the KillUnusedSegmentsTask. This will allow us to do batching, while also ensuring better consistency between the existing behaviour of "when nuke is OK, remove S3 files, otherwise leave the S3 records intact". Batching in only the SegmentNuke layer would have made the failure handling logic overly complicated.

jasonk000 · 2023-07-25T04:58:36Z

@maytasm @kfaraz how do we feel now about this PR? I've focused the change on the one specific problem area, and added some extra comments to explain the code a bit more thoroughly.

There's an opportunity that I'm ignoring here for the moment, similar to @kfaraz suggestion, which is that the Retrieve Unused Tasks can limit the number retrieved, or divide and request a smaller interval and iterate intervals in smaller blocks. It looks like RetrieveUnusedSegmentsAction is also used by MoveTask, ArchiveTask, and RestoreTask in addition to KillUnusedSegmentsTask.

Processing in smaller chunks allows the task execution to yield the TaskLockbox lock, which allows the overlord to continue being responsive to other tasks and users while this particular kill task is executing.

kfaraz

Thanks the for changes, @jasonk000 !
I have left some comments. Could you please also update the PR description to reflect the latest set of changes?

...xing-service/src/main/java/org/apache/druid/indexing/common/task/KillUnusedSegmentsTask.java

…of batching

...xing-service/src/main/java/org/apache/druid/indexing/common/task/KillUnusedSegmentsTask.java

...vice/src/test/java/org/apache/druid/indexing/test/TestIndexerMetadataStorageCoordinator.java

...xing-service/src/main/java/org/apache/druid/indexing/common/task/KillUnusedSegmentsTask.java

docs/data-management/delete.md

maytasm

LGTM.
I have some minor comments but is +1 on the design/approach

kfaraz

I agree with the overall approach given that we want the change to have a small footprint while reducing the lock burden on Overlord.

I have left some minor comments. +1 after those are addressed.

...xing-service/src/main/java/org/apache/druid/indexing/common/task/KillUnusedSegmentsTask.java

* provide a longer explanation for kill task batchSize parameter * add logging details for kill batch progress * javadoc and field name changes

jasonk000 · 2023-07-29T21:44:45Z

sample logging output during test

2023-07-29T21:15:04,585 INFO [main] org.apache.druid.indexing.common.task.KillUnusedSegmentsTask - Marked 4 segments as unused.
2023-07-29T21:15:04,586 INFO [main] org.apache.druid.metadata.IndexerSQLMetadataStorageCoordinator - Found 4 unused segments for dataSource for interval 2018-01-01T00:00:00.000Z/2020-01-01T00:00:00.000Z.
2023-07-29T21:15:04,586 INFO [main] org.apache.druid.indexing.common.task.KillUnusedSegmentsTask - kill starting: id [kill_dataSource_kmgianbe_2018-01-01T00:00:00.000Z_2020-01-01T00:00:00.000Z_2023-07-29T21:15:04.580Z] dataSource [dataSource] interval [2018-01-01T00:00:00.000Z/2020-01-01T00:00:00.000Z], total segments [4], batches [2], ([3] segments per batch)
2023-07-29T21:15:04,587 INFO [main] org.apache.druid.metadata.IndexerSQLMetadataStorageCoordinator - Deleted [3] segments from metadata storage for dataSource [dataSource].
2023-07-29T21:15:04,587 INFO [main] org.apache.druid.metadata.IndexerSQLMetadataStorageCoordinator - Deleted [1] segments from metadata storage for dataSource [dataSource].
2023-07-29T21:15:04,587 INFO [main] org.apache.druid.indexing.common.task.KillUnusedSegmentsTask - kill complete: id [kill_dataSource_kmgianbe_2018-01-01T00:00:00.000Z_2020-01-01T00:00:00.000Z_2023-07-29T21:15:04.580Z] dataSource [dataSource] interval [2018-01-01T00:00:00.000Z/2020-01-01T00:00:00.000Z], total segments [4], batches [2]

jasonk000 · 2023-07-29T21:51:09Z

added logging and javadoc changes in d3d5a71 and 3fc9e96

...xing-service/src/main/java/org/apache/druid/indexing/common/task/KillUnusedSegmentsTask.java

kfaraz

Minor logging and doc suggestions.

docs/data-management/delete.md

...xing-service/src/main/java/org/apache/druid/indexing/common/task/KillUnusedSegmentsTask.java

better description of the task parameters Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

better description of batch size parameter Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

…indexing/common/task/KillUnusedSegmentsTask.java updated druid logging style Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

...xing-service/src/main/java/org/apache/druid/indexing/common/task/KillUnusedSegmentsTask.java

abhishekagarwal87 · 2023-09-08T06:28:57Z

@jasonk000 - can you please add some release notes to the PR description?

jasonk000 · 2023-09-16T19:04:19Z

@abhishekagarwal87 added some notes to description.

jasonk000 requested a review from kfaraz July 23, 2023 20:09

split KillUnusedSegmentsTask to smaller batches

a372da0

Processing in smaller chunks allows the task execution to yield the TaskLockbox lock, which allows the overlord to continue being responsive to other tasks and users while this particular kill task is executing.

jasonk000 force-pushed the segment-nuke-chunk-processing branch from d08fcff to a372da0 Compare July 27, 2023 22:54

jasonk000 requested a review from maytasm July 27, 2023 22:57

jasonk000 changed the title ~~split segment-nuke action to smaller chunks~~ split KillUnusedSegmentsTask to processing in smaller chunks Jul 27, 2023

jasonk000 marked this pull request as ready for review July 27, 2023 23:03

kfaraz reviewed Jul 28, 2023

View reviewed changes

...xing-service/src/main/java/org/apache/druid/indexing/common/task/KillUnusedSegmentsTask.java Show resolved Hide resolved

...xing-service/src/main/java/org/apache/druid/indexing/common/task/KillUnusedSegmentsTask.java Outdated Show resolved Hide resolved

introduce KillUnusedSegmentsTask batchSize parameter to control size …

011d043

…of batching

github-actions bot added the Area - Documentation label Jul 28, 2023

github-advanced-security bot found potential problems Jul 28, 2023

View reviewed changes

...xing-service/src/main/java/org/apache/druid/indexing/common/task/KillUnusedSegmentsTask.java Fixed Show fixed Hide fixed

...vice/src/test/java/org/apache/druid/indexing/test/TestIndexerMetadataStorageCoordinator.java Dismissed Show dismissed Hide dismissed

switch getCountBatchesIssued from int to long, just in case

ffc0659

maytasm reviewed Jul 29, 2023

View reviewed changes

...xing-service/src/main/java/org/apache/druid/indexing/common/task/KillUnusedSegmentsTask.java Outdated Show resolved Hide resolved

maytasm reviewed Jul 29, 2023

View reviewed changes

docs/data-management/delete.md Outdated Show resolved Hide resolved

maytasm approved these changes Jul 29, 2023

View reviewed changes

kfaraz requested changes Jul 29, 2023

View reviewed changes

split KillUnusedSegmentsTask to smaller batches (pr feedback)

d3d5a71

* provide a longer explanation for kill task batchSize parameter * add logging details for kill batch progress * javadoc and field name changes

split KillUnusedSegmentsTask to smaller batches (better javadocs)

3fc9e96

github-advanced-security bot found potential problems Jul 29, 2023

View reviewed changes

...xing-service/src/main/java/org/apache/druid/indexing/common/task/KillUnusedSegmentsTask.java Fixed Show fixed Hide fixed

kfaraz approved these changes Jul 31, 2023

View reviewed changes

jasonk000 and others added 5 commits July 31, 2023 09:16

pr feedback - Update docs/data-management/delete.md

4527a29

better description of the task parameters Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

pr feedback - Update docs/data-management/delete.md

400ae47

better description of batch size parameter Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

pr feedback - Update indexing-service/src/main/java/org/apache/druid/…

f35135b

…indexing/common/task/KillUnusedSegmentsTask.java updated druid logging style Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

pr feedback - Update indexing-service/src/main/java/org/apache/druid/…

8907b99

…indexing/common/task/KillUnusedSegmentsTask.java updated druid logging style Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

pr feedback - Update indexing-service/src/main/java/org/apache/druid/…

8e19cb2

…indexing/common/task/KillUnusedSegmentsTask.java updated druid logging style Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

github-advanced-security bot found potential problems Jul 31, 2023

View reviewed changes

...xing-service/src/main/java/org/apache/druid/indexing/common/task/KillUnusedSegmentsTask.java Dismissed Show dismissed Hide dismissed

zachjsh mentioned this pull request Jul 31, 2023

Add ability to limit the number of segments killed in kill task #14662

Merged

10 tasks

zachjsh approved these changes Jul 31, 2023

View reviewed changes

maytasm added Design Review Area - Operations labels Jul 31, 2023

jasonk000 merged commit 44d5c1a into apache:master Jul 31, 2023
74 checks passed

jasonk000 deleted the segment-nuke-chunk-processing branch August 3, 2023 22:08

LakshSingla added this to the 28.0 milestone Oct 12, 2023

renatocron mentioned this pull request Nov 2, 2023

Kill Task improvements #15312

Open

LakshSingla mentioned this pull request Nov 4, 2023

[DRAFT] 28.0.0 release notes #15326

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

split KillUnusedSegmentsTask to processing in smaller chunks #14642

split KillUnusedSegmentsTask to processing in smaller chunks #14642

jasonk000 commented Jul 23, 2023 •

edited

Loading

jasonk000 commented Jul 23, 2023 •

edited

Loading

kfaraz commented Jul 24, 2023

kfaraz commented Jul 24, 2023

maytasm commented Jul 24, 2023 •

edited

Loading

kfaraz commented Jul 24, 2023 •

edited

Loading

jasonk000 commented Jul 24, 2023 •

edited

Loading

jasonk000 commented Jul 24, 2023

maytasm commented Jul 24, 2023 •

edited

Loading

jasonk000 commented Jul 25, 2023

jasonk000 commented Jul 25, 2023 •

edited

Loading

kfaraz left a comment

maytasm left a comment

kfaraz left a comment

jasonk000 commented Jul 29, 2023

jasonk000 commented Jul 29, 2023

kfaraz left a comment •

edited

Loading

abhishekagarwal87 commented Sep 8, 2023

jasonk000 commented Sep 16, 2023

split KillUnusedSegmentsTask to processing in smaller chunks #14642

split KillUnusedSegmentsTask to processing in smaller chunks #14642

Conversation

jasonk000 commented Jul 23, 2023 • edited Loading

Description

Key changed/added classes in this PR

Release notes

jasonk000 commented Jul 23, 2023 • edited Loading

kfaraz commented Jul 24, 2023

kfaraz commented Jul 24, 2023

maytasm commented Jul 24, 2023 • edited Loading

kfaraz commented Jul 24, 2023 • edited Loading

jasonk000 commented Jul 24, 2023 • edited Loading

jasonk000 commented Jul 24, 2023

maytasm commented Jul 24, 2023 • edited Loading

jasonk000 commented Jul 25, 2023

jasonk000 commented Jul 25, 2023 • edited Loading

kfaraz left a comment

Choose a reason for hiding this comment

maytasm left a comment

Choose a reason for hiding this comment

kfaraz left a comment

Choose a reason for hiding this comment

jasonk000 commented Jul 29, 2023

jasonk000 commented Jul 29, 2023

kfaraz left a comment • edited Loading

Choose a reason for hiding this comment

abhishekagarwal87 commented Sep 8, 2023

jasonk000 commented Sep 16, 2023

jasonk000 commented Jul 23, 2023 •

edited

Loading

jasonk000 commented Jul 23, 2023 •

edited

Loading

maytasm commented Jul 24, 2023 •

edited

Loading

kfaraz commented Jul 24, 2023 •

edited

Loading

jasonk000 commented Jul 24, 2023 •

edited

Loading

maytasm commented Jul 24, 2023 •

edited

Loading

jasonk000 commented Jul 25, 2023 •

edited

Loading

kfaraz left a comment •

edited

Loading