Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kill Task improvements #15312

Open
renatocron opened this issue Nov 2, 2023 · 0 comments
Open

Kill Task improvements #15312

renatocron opened this issue Nov 2, 2023 · 0 comments

Comments

@renatocron
Copy link

renatocron commented Nov 2, 2023

Motivation

As Druid heralds the arrival of its 28.0.0 version, with many improvements over the galore sweeper, the Kill Task:

I would like to suggest two new features that would improve kill tasks ever further.

Proposed changes

Delete rollback plan if task fails

The first change is during the preparation phase. Currently, after Druid selects segments that are not in use, it issues the Delete query to the Metadata Storage before initiating the deletion action itself. During this time, if the task fails for any reason, the files will remain in the storage indefinitely, with no database record, necessitating a manual and risky operation for non-Druid specialists to remove them.

My suggestion is that, since many types of metadata storage are supported, instead of issuing the delete query beforehand, we do so after the task has completed deletion from storage. If the new PR 14131 becomes the default behavior, missing keys in the storage should not be considered a fault. This situation may have been unexpected before this version, but it would be anticipated from this fix onwards, and this behavior could be adjusted for the previous per-key operation.

Other solutions could include a two-phase commit, but that seems like overengineering for this task. A simpler solution, I believe, could be more effective: instead of removing segments immediately, move them to another table with a reference to the task name added. If the task fails, system administrators could easily restore this table back into the segments table and issue the Kill Task again.

The reason I discovered this, was because I initiated a very large (100,000 segments) kill task, and the router (or perhaps the broker - I'm not certain) timed out. The task failed, but the database did, in fact, remove all the unused rows for the task period. I had to issue a PITR (Point-In-Time Recovery) and insert the missing rows back, as I thought this was the safer option rather than attempting to extract the difference and delete the segments manually from the Deep Storage.

New selection mode

I would like a new selection mode, that would delete not all unused segments, but only the overshadowed¹ one's.

image

I used to had a compression job that run daily², and it made many versions of the segments, and as they were becoming out of the loadByPeriod window, if I issue a Kill Task, they will both the historical data that I want to preserve and the compaction jobs.

Now, to clean my historical and keep both the postgres database light, and the deep storage without too much unnecessary data, I need to:

  • extend the period of loadByPeriod
  • send command to mark used by Period (eg: load 3 days, it loads ~ 9 segments)
  • wait the historical to load, just in case!, I think it's unecessary, just the segment mark as use query complete should be fine
  • issue the kill task for this period (it kill ~ 1000 segments, keep the 9 segments as they are used)
  • unload, repeat for another period

Thank you,

¹: I think that the correct term for the segments that are not used because they got compacted and their data got merged into a newer segment
²: now I changed to run after a long period, when no late data arrive, so I can be sure to reduce the number of compactions that happen, but this will impact query performance, as now I have many more segments during that period of ~ 21 days

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant