-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor and add tests and metric to KillUnusedSegments duty (auto-kill) #15941
Refactor and add tests and metric to KillUnusedSegments duty (auto-kill) #15941
Conversation
Initial commit with: - Bug fixes - auto-kill can throw NPE when there are no datasources present and defaults mismatch. - Add new stat for candidate segment intervals killed. - Move a couple of debug logs to info logs for improved visibility (should only log once per kill period). - Remove redundant checks for code readability. - Updated tests from using mocks (also the mocks weren't using last updated timestamp) and add more test coverage for different config parameters. - Add a couple of unit tests that are ignored for the eternity case to prove that the kill duty doesn't clean up segments with ALL grain or that end in DateTimes.MAX. - Migrate Druid exception from user to operator persona.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, the changes make the code more readable. I couldn't find the main NPE bugfix amidst all the refactor. If possible, maybe create two separate PRs - one to add the metric and the bugfix and another for all the refactors?
I have left some comments even though the PR is still in draft, so that changes can be incorporated early.
.../src/test/java/org/apache/druid/server/coordinator/simulate/TestSegmentsMetadataManager.java
Outdated
Show resolved
Hide resolved
.../src/test/java/org/apache/druid/server/coordinator/simulate/TestSegmentsMetadataManager.java
Outdated
Show resolved
Hide resolved
.../src/test/java/org/apache/druid/server/coordinator/simulate/TestSegmentsMetadataManager.java
Outdated
Show resolved
Hide resolved
.../src/test/java/org/apache/druid/server/coordinator/simulate/TestSegmentsMetadataManager.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/apache/druid/server/coordinator/duty/KillUnusedSegments.java
Show resolved
Hide resolved
.../src/test/java/org/apache/druid/server/coordinator/simulate/TestSegmentsMetadataManager.java
Outdated
Show resolved
Hide resolved
.../src/test/java/org/apache/druid/server/coordinator/simulate/TestSegmentsMetadataManager.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/apache/druid/server/coordinator/duty/KillUnusedSegmentsTest.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/apache/druid/server/coordinator/duty/KillUnusedSegmentsTest.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/apache/druid/server/coordinator/duty/KillUnusedSegmentsTest.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/apache/druid/server/coordinator/duty/KillUnusedSegmentsTest.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/apache/druid/server/coordinator/duty/KillUnusedSegmentsTest.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/apache/druid/server/coordinator/duty/KillUnusedSegmentsTest.java
Show resolved
Hide resolved
server/src/test/java/org/apache/druid/server/coordinator/duty/KillUnusedSegmentsTest.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/apache/druid/server/coordinator/duty/KillUnusedSegmentsTest.java
Outdated
Show resolved
Hide resolved
Also, rename to disambiguate.
Thanks for the review, @kfaraz! I've addressed your comments and kept the new metric along with other minor functional changes in this patch. But let me know if you'd prefer to keep the changes separate. Re the NPE fix, my bad, it was from existing test code -- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the prompt response, @abhishekrb19 !
Left a few minor comments, rest of the changes look good.
server/src/main/java/org/apache/druid/server/coordinator/duty/KillUnusedSegments.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/apache/druid/server/coordinator/duty/KillUnusedSegments.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/apache/druid/server/coordinator/duty/KillUnusedSegments.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/apache/druid/server/coordinator/duty/KillUnusedSegments.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/apache/druid/server/coordinator/duty/KillUnusedSegments.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/apache/druid/server/coordinator/duty/KillUnusedSegments.java
Outdated
Show resolved
Hide resolved
.../src/test/java/org/apache/druid/server/coordinator/simulate/TestSegmentsMetadataManager.java
Outdated
Show resolved
Hide resolved
.../src/test/java/org/apache/druid/server/coordinator/simulate/TestSegmentsMetadataManager.java
Outdated
Show resolved
Hide resolved
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
- Clarify docs on eligibility. - Add test for multiple segments in the same interval. Clarify comment. - Remove log line from test. - Remove lastUpdatedDate = now.plus(10) from test.
@@ -342,7 +342,7 @@ These metrics are for the Druid Coordinator and are reset each time the Coordina | |||
|`killTask/availableSlot/count`| Number of available task slots that can be used for auto kill tasks in the auto kill run. This is the max number of task slots minus any currently running auto kill tasks. | |Varies| | |||
|`killTask/maxSlot/count`| Maximum number of task slots available for auto kill tasks in the auto kill run. | |Varies| | |||
|`kill/task/count`| Number of tasks issued in the auto kill run. | |Varies| | |||
|`kill/candidateUnusedSegments/count`|Number of candidate unused segments to be deleted from the metadata store in an auto kill run.|`dataSource`|Varies| | |||
|`kill/candidateUnusedSegments/count`|The number of candidate unused segments eligible for deletion from the metadata store during an auto kill run for a datasource.|`dataSource`|Varies| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|`kill/candidateUnusedSegments/count`|The number of candidate unused segments eligible for deletion from the metadata store during an auto kill run for a datasource.|`dataSource`|Varies| | |
|`kill/eligibleUnusedSegments/count`|The number of unused segments of a datasource that are identified as eligible for deletion from the metadata store by the coordinator.|`dataSource`|Varies| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only 1 minor suggestion regarding docs and metric name. But that change can be done later and need not block this PR. It would trigger the whole CI pipeline all over again.
+1
Thanks, @kfaraz! I will add your rename suggestion in a follow-up patch. |
Address review comment from apache#15941 (comment)
Address review comment from apache#15941 (comment)
… & rename metric (#15977) * All segments stored in the same batch have the same created_date entry. In the absence of a group_id column, this metadata would allow us to easily reason about and troubleshoot ingestion-related issues. * Rename metric name and code references to eligibleUnusedSegments. Address review comment from #15941 (comment)
This PR makes a few functional and non-functional changes in the auto-kill
KillUnusedSegments
duty.Functional changes:
kill/eligibleUnusedSegments/count
with dimensiondataSource
that tracks the number of candidate unused segments per datasource per kill cycle. It's candidate because it's the kill task that ultimately kills the unused segments and the number can change.Refactoring changes:
@VisibleForTesting
annotations as the test code now verifies the duty directly instead of mocks + various hooks into the code.Testing changes:
TestDruidCoordinatorConfig.DEFAULT_COORDINATOR_KILL_BUFFER_PERIOD
was incorrectly set to 1 day instead of 90 days. While at it, I changed all the test coordinator defaults to match the same ISO 8601 period format asDruidCoordinatorConfig
, so it's easy to spot check.DateTimes.MAX
(outside the normal Datetime string comparison range): Auto-kill doesn't delete segments outside the range [0000-01-01/10000-01-01) #15951Motivation
In general, I found it difficult to track the progress of the auto-kill coordinator duty. Also, see some discussion in the community slack: https://apachedruidworkspace.slack.com/archives/C0303FDCZEZ/p1707941058448649.
My preliminary conclusion from looking at a couple of clusters is that the auto-kill defaults are overly conservative and need to be revisited as #15912 notes. I would like to address some of the concerns raised in #15911 and #15912 separately once we have some additional data and tests from this patch.
Release note
Add a new metric
kill/eligibleUnusedSegments/count
with dimensiondataSource
that tracks the number of eligible unused segments of a datasource that are eligible for deletion by the coordinator.Key changed/added classes in this PR
KillUnusedSegments.java
KillUnusedSegmentsTest.java
This PR has: