Kill Task improvements #15312

renatocron · 2023-11-02T17:11:47Z

Motivation

As Druid heralds the arrival of its 28.0.0 version, with many improvements over the galore sweeper, the Kill Task:

Optimize used segment fetching in Kill tasks #15107
Add a TaskReport for "kill" tasks #15023
Enable Continuous auto kill #14831
Add ability to limit the number of segments killed in kill task #14662
retry when killing s3 based segments #14776 [per key delete]
Speed up kill tasks by deleting segments in batch #14131 [batch delete operation in S3]
split KillUnusedSegmentsTask to processing in smaller chunks #14642 [I missed that one initially, may solve some part of the first issue]

I would like to suggest two new features that would improve kill tasks ever further.

Proposed changes

Delete rollback plan if task fails

The first change is during the preparation phase. Currently, after Druid selects segments that are not in use, it issues the Delete query to the Metadata Storage before initiating the deletion action itself. During this time, if the task fails for any reason, the files will remain in the storage indefinitely, with no database record, necessitating a manual and risky operation for non-Druid specialists to remove them.

My suggestion is that, since many types of metadata storage are supported, instead of issuing the delete query beforehand, we do so after the task has completed deletion from storage. If the new PR 14131 becomes the default behavior, missing keys in the storage should not be considered a fault. This situation may have been unexpected before this version, but it would be anticipated from this fix onwards, and this behavior could be adjusted for the previous per-key operation.

Other solutions could include a two-phase commit, but that seems like overengineering for this task. A simpler solution, I believe, could be more effective: instead of removing segments immediately, move them to another table with a reference to the task name added. If the task fails, system administrators could easily restore this table back into the segments table and issue the Kill Task again.

The reason I discovered this, was because I initiated a very large (100,000 segments) kill task, and the router (or perhaps the broker - I'm not certain) timed out. The task failed, but the database did, in fact, remove all the unused rows for the task period. I had to issue a PITR (Point-In-Time Recovery) and insert the missing rows back, as I thought this was the safer option rather than attempting to extract the difference and delete the segments manually from the Deep Storage.

New selection mode

I would like a new selection mode, that would delete not all unused segments, but only the overshadowed¹ one's.

I used to had a compression job that run daily², and it made many versions of the segments, and as they were becoming out of the loadByPeriod window, if I issue a Kill Task, they will both the historical data that I want to preserve and the compaction jobs.

Now, to clean my historical and keep both the postgres database light, and the deep storage without too much unnecessary data, I need to:

extend the period of loadByPeriod
send command to mark used by Period (eg: load 3 days, it loads ~ 9 segments)
wait the historical to load, just in case!, I think it's unecessary, just the segment mark as use query complete should be fine
issue the kill task for this period (it kill ~ 1000 segments, keep the 9 segments as they are used)
unload, repeat for another period

Thank you,

¹: I think that the correct term for the segments that are not used because they got compacted and their data got merged into a newer segment
²: now I changed to run after a long period, when no late data arrive, so I can be sure to reduce the number of compactions that happen, but this will impact query performance, as now I have many more segments during that period of ~ 21 days

renatocron added Design Review Proposal labels Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kill Task improvements #15312

Kill Task improvements #15312

renatocron commented Nov 2, 2023 •

edited

Loading

Kill Task improvements #15312

Kill Task improvements #15312

Comments

renatocron commented Nov 2, 2023 • edited Loading

Motivation

Proposed changes

Delete rollback plan if task fails

New selection mode

renatocron commented Nov 2, 2023 •

edited

Loading