You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to suggest two new features that would improve kill tasks ever further.
Proposed changes
Delete rollback plan if task fails
The first change is during the preparation phase. Currently, after Druid selects segments that are not in use, it issues the Delete query to the Metadata Storage before initiating the deletion action itself. During this time, if the task fails for any reason, the files will remain in the storage indefinitely, with no database record, necessitating a manual and risky operation for non-Druid specialists to remove them.
My suggestion is that, since many types of metadata storage are supported, instead of issuing the delete query beforehand, we do so after the task has completed deletion from storage. If the new PR 14131 becomes the default behavior, missing keys in the storage should not be considered a fault. This situation may have been unexpected before this version, but it would be anticipated from this fix onwards, and this behavior could be adjusted for the previous per-key operation.
Other solutions could include a two-phase commit, but that seems like overengineering for this task. A simpler solution, I believe, could be more effective: instead of removing segments immediately, move them to another table with a reference to the task name added. If the task fails, system administrators could easily restore this table back into the segments table and issue the Kill Task again.
The reason I discovered this, was because I initiated a very large (100,000 segments) kill task, and the router (or perhaps the broker - I'm not certain) timed out. The task failed, but the database did, in fact, remove all the unused rows for the task period. I had to issue a PITR (Point-In-Time Recovery) and insert the missing rows back, as I thought this was the safer option rather than attempting to extract the difference and delete the segments manually from the Deep Storage.
New selection mode
I would like a new selection mode, that would delete not all unused segments, but only the overshadowed¹ one's.
I used to had a compression job that run daily², and it made many versions of the segments, and as they were becoming out of the loadByPeriod window, if I issue a Kill Task, they will both the historical data that I want to preserve and the compaction jobs.
Now, to clean my historical and keep both the postgres database light, and the deep storage without too much unnecessary data, I need to:
extend the period of loadByPeriod
send command to mark used by Period (eg: load 3 days, it loads ~ 9 segments)
wait the historical to load, just in case!, I think it's unecessary, just the segment mark as use query complete should be fine
issue the kill task for this period (it kill ~ 1000 segments, keep the 9 segments as they are used)
unload, repeat for another period
Thank you,
¹: I think that the correct term for the segments that are not used because they got compacted and their data got merged into a newer segment
²: now I changed to run after a long period, when no late data arrive, so I can be sure to reduce the number of compactions that happen, but this will impact query performance, as now I have many more segments during that period of ~ 21 days
The text was updated successfully, but these errors were encountered:
Motivation
As Druid heralds the arrival of its 28.0.0 version, with many improvements over the galore sweeper, the Kill Task:
I would like to suggest two new features that would improve kill tasks ever further.
Proposed changes
Delete rollback plan if task fails
The first change is during the preparation phase. Currently, after Druid selects segments that are not in use, it issues the Delete query to the Metadata Storage before initiating the deletion action itself. During this time, if the task fails for any reason, the files will remain in the storage indefinitely, with no database record, necessitating a manual and risky operation for non-Druid specialists to remove them.
My suggestion is that, since many types of metadata storage are supported, instead of issuing the delete query beforehand, we do so after the task has completed deletion from storage. If the new PR 14131 becomes the default behavior, missing keys in the storage should not be considered a fault. This situation may have been unexpected before this version, but it would be anticipated from this fix onwards, and this behavior could be adjusted for the previous per-key operation.
Other solutions could include a two-phase commit, but that seems like overengineering for this task. A simpler solution, I believe, could be more effective: instead of removing segments immediately, move them to another table with a reference to the task name added. If the task fails, system administrators could easily restore this table back into the segments table and issue the Kill Task again.
The reason I discovered this, was because I initiated a very large (100,000 segments) kill task, and the router (or perhaps the broker - I'm not certain) timed out. The task failed, but the database did, in fact, remove all the unused rows for the task period. I had to issue a PITR (Point-In-Time Recovery) and insert the missing rows back, as I thought this was the safer option rather than attempting to extract the difference and delete the segments manually from the Deep Storage.
New selection mode
I would like a new selection mode, that would delete not all unused segments, but only the overshadowed¹ one's.
I used to had a compression job that run daily², and it made many versions of the segments, and as they were becoming out of the loadByPeriod window, if I issue a Kill Task, they will both the historical data that I want to preserve and the compaction jobs.
Now, to clean my historical and keep both the postgres database light, and the deep storage without too much unnecessary data, I need to:
Thank you,
¹: I think that the correct term for the segments that are not used because they got compacted and their data got merged into a newer segment
²: now I changed to run after a long period, when no late data arrive, so I can be sure to reduce the number of compactions that happen, but this will impact query performance, as now I have many more segments during that period of ~ 21 days
The text was updated successfully, but these errors were encountered: