-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set default KillUnusedSegments
duty to coordinator's indexing period & killTaskSlotRatio
to 0.1
#16247
Set default KillUnusedSegments
duty to coordinator's indexing period & killTaskSlotRatio
to 0.1
#16247
Conversation
…dexingPeriod if not set. - Remove the default P1D value for druid.coordinator.kill.period. Instead default druid.coordinator.kill.period to whatever value druid.coordinator.period.indexingPeriod is set to if the former config isn't specified. - If druid.coordinator.kill.period is set, the value will take precedence over druid.coordinator.period.indexingPeriod
server/src/test/java/org/apache/druid/server/coordinator/DruidCoordinatorConfigTest.java
Outdated
Show resolved
Hide resolved
…CoordinatorConfigTest.java
…cubator-druid into revisit_auto_kill_period
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abhishekrb19 , it makes sense to update default values of configs to improve out-of-the-box Druid behaviour. But I have some concerns.
After this change, the default druid.coordinator.kill.period
effectively changes from 1 day to 30 mins (default value of indexing period). So potentially, there could be a kill
task submitted for every datasource every 30 mins. I wonder if there can be clusters for which this is harmful / wasteful in terms of task resources.
Clusters which need frequent cleanup would have already set a low kill.period
. But now other clusters which didn't want frequent cleanup (due to limited task resources) would have to go and update their configs.
For the linked issue #15911 , have we been able to debug why the kill.period
now seems to be offset by the indexingPeriod
? That seems like a bug rather than an issue with the default value.
server/src/main/java/org/apache/druid/server/coordinator/DruidCoordinatorConfig.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/apache/druid/server/coordinator/duty/KillUnusedSegments.java
Show resolved
Hide resolved
server/src/main/java/org/apache/druid/server/coordinator/duty/KillUnusedSegments.java
Outdated
Show resolved
Hide resolved
…CoordinatorConfig.java Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
@kfaraz - thanks for the swift review! Re:
I was thinking of changing the default Upon reviewing the reasoning in the code here for the current default values of kill task slots, it appears that the unbounded values were chosen to preserve old behavior, which I'd argue was not ideal to start off with. Given that we're changing the kill period's default in this patch, we might as well take an opportunity to adjust adjacent kill configurations to set them to safe values for a better out-of-the-box experience, and highlight the changes in the release notes. What do you think? |
Just as above, would setting the default value of
Yes, there were more details in this community slack thread |
server/src/test/java/org/apache/druid/server/coordinator/DruidCoordinatorConfigTest.java
Show resolved
Hide resolved
server/src/main/java/org/apache/druid/server/coordinator/DruidCoordinatorConfig.java
Show resolved
Hide resolved
Yes, @abhishekrb19 , I think reducing |
KillUnusedSegments
duty to coordinator's indexing period & killTaskSlots
to 0.1
@@ -201,7 +197,7 @@ public void testSerde() throws Exception | |||
1, | |||
2, | |||
whitelist, | |||
1.0, | |||
0.1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value change from 1.0 to 0.1 is the default change of killTaskSlotRatio
made in this patch.
The test should just use a builder to make it clear, or the assertion should happen in the tests, but that cleanup is for another day.
KillUnusedSegments
duty to coordinator's indexing period & killTaskSlots
to 0.1KillUnusedSegments
duty to coordinator's indexing period & killTaskSlotRatio
to 0.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Problem:
When
druid.coordinator.kill.period
is set to something other than the coordinator's indexing period, it can cause confusion regarding when the duty will run. Please see #15911 for example. Also, the defaultP1D
may be overly conservative for most users as noted in this comment.Separately, the default value of the dynamic coordinator config
killTaskSlotRatio
was previously1.0
, which means virtually all available slots in a cluster can be used by theKillUnusedSegments
duty, starving any other tasks in the system.Fix:
Fixes #15911. With sufficient guard rails available in the kill duty, the duty can just run regularly at the same time as the coordinator's indexing period.
druid.coordinator.kill.period
is not set explicitly, it will take the value ofdruid.coordinator.period.indexingPeriod
(whether it's the default or an overridden value)druid.coordinator.kill.period
and this will take precedence over the default behavior noted above.killTaskSlotRatio
to 0.1. This means with the default dynamic coordinator config, theKillUnusedSegments
duty will at most take only 10% of the available cluster capacity task slots.We have been successfully running the auto-kill duty in production with
druid.coordinator.kill.period
set todruid.coordinator.period.indexingPeriod
, even atPT15M
. Note that we could deprecatedruid.coordinator.kill.period
, but in the ethos of providing flexibility to operators, this patch doesn't do so.Other related changes
CoordinatorCompactionConfig
DruidCoordinatorConfigTest
Release note
druid.coordinator.kill.period
(if unspecified) has changed fromP1D
to the value ofdruid.coordinator.period.indexingPeriod
. Operators can choose to overridedruid.coordinator.kill.period
and that will take precedence over the default behavior.killTaskSlotRatio
is updated from1.0
to0.1
. This ensures that that kill tasks take up only 1 task slot right out-of-the-box instead of taking up all the task slots.This PR has: