Refactor and add tests and metric to KillUnusedSegments duty (auto-kill) #15941

abhishekrb19 · 2024-02-22T14:11:57Z

This PR makes a few functional and non-functional changes in the auto-kill KillUnusedSegments duty.

Functional changes:

Add a new metric kill/eligibleUnusedSegments/count with dimension dataSource that tracks the number of candidate unused segments per datasource per kill cycle. It's candidate because it's the kill task that ultimately kills the unused segments and the number can change.
Move a debug log to info for improved visibility. This should be fine as it gets logged only once per indexing/kill task period.
Change the kill duty DruidException validations from user to operator persona as they're configured by operators typically.

Refactoring changes:

Remove redundant checks in the code for code readability.
Remove all the unnecessary @VisibleForTesting annotations as the test code now verifies the duty directly instead of mocks + various hooks into the code.
Finalize some parameters and variables to clarify the intent.

Testing changes:

Updated tests from using mocks (the mocks were difficult to use and buggy - for example, last updated timestamp was never used.
Fix a bug where TestDruidCoordinatorConfig.DEFAULT_COORDINATOR_KILL_BUFFER_PERIOD was incorrectly set to 1 day instead of 90 days. While at it, I changed all the test coordinator defaults to match the same ISO 8601 period format as DruidCoordinatorConfig, so it's easy to spot check.
Add more test coverage for the different config parameters.
Add a couple of unit tests that are ignored currently as they fail. The duty doesn't consider segments spanning eternity, or end in DateTimes.MAX (outside the normal Datetime string comparison range): Auto-kill doesn't delete segments outside the range [0000-01-01/10000-01-01) #15951

Motivation

In general, I found it difficult to track the progress of the auto-kill coordinator duty. Also, see some discussion in the community slack: https://apachedruidworkspace.slack.com/archives/C0303FDCZEZ/p1707941058448649.
My preliminary conclusion from looking at a couple of clusters is that the auto-kill defaults are overly conservative and need to be revisited as #15912 notes. I would like to address some of the concerns raised in #15911 and #15912 separately once we have some additional data and tests from this patch.

Release note

Add a new metric kill/eligibleUnusedSegments/count with dimension dataSource that tracks the number of eligible unused segments of a datasource that are eligible for deletion by the coordinator.

Key changed/added classes in this PR

KillUnusedSegments.java
KillUnusedSegmentsTest.java

This PR has:

Initial commit with: - Bug fixes - auto-kill can throw NPE when there are no datasources present and defaults mismatch. - Add new stat for candidate segment intervals killed. - Move a couple of debug logs to info logs for improved visibility (should only log once per kill period). - Remove redundant checks for code readability. - Updated tests from using mocks (also the mocks weren't using last updated timestamp) and add more test coverage for different config parameters. - Add a couple of unit tests that are ignored for the eternity case to prove that the kill duty doesn't clean up segments with ALL grain or that end in DateTimes.MAX. - Migrate Druid exception from user to operator persona.

kfaraz

Overall, the changes make the code more readable. I couldn't find the main NPE bugfix amidst all the refactor. If possible, maybe create two separate PRs - one to add the metric and the bugfix and another for all the refactors?

I have left some comments even though the PR is still in draft, so that changes can be incorporated early.

.../src/test/java/org/apache/druid/server/coordinator/simulate/TestSegmentsMetadataManager.java

server/src/main/java/org/apache/druid/server/coordinator/duty/KillUnusedSegments.java

.../src/test/java/org/apache/druid/server/coordinator/simulate/TestSegmentsMetadataManager.java

server/src/test/java/org/apache/druid/server/coordinator/duty/KillUnusedSegmentsTest.java

Also, rename to disambiguate.

abhishekrb19 · 2024-02-23T05:05:51Z

Thanks for the review, @kfaraz! I've addressed your comments and kept the new metric along with other minor functional changes in this patch. But let me know if you'd prefer to keep the changes separate. Re the NPE fix, my bad, it was from existing test code --TestSegmentsMetadataManager was returning null instead of an empty set for retrieveAllDataSourceNames(). Removed that from the PR description.

kfaraz

Thanks for the prompt response, @abhishekrb19 !
Left a few minor comments, rest of the changes look good.

server/src/main/java/org/apache/druid/server/coordinator/duty/KillUnusedSegments.java

docs/operations/metrics.md

.../src/test/java/org/apache/druid/server/coordinator/simulate/TestSegmentsMetadataManager.java

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

- Clarify docs on eligibility. - Add test for multiple segments in the same interval. Clarify comment. - Remove log line from test. - Remove lastUpdatedDate = now.plus(10) from test.

kfaraz · 2024-02-27T04:48:11Z

docs/operations/metrics.md

@@ -342,7 +342,7 @@ These metrics are for the Druid Coordinator and are reset each time the Coordina
 |`killTask/availableSlot/count`| Number of available task slots that can be used for auto kill tasks in the auto kill run. This is the max number of task slots minus any currently running auto kill tasks.                                                                                                                                                                                                                                                                                                                     | |Varies|
 |`killTask/maxSlot/count`| Maximum number of task slots available for auto kill tasks in the auto kill run.                                                                                                                                                                                                                                                                                                                                                                                                                | |Varies|
 |`kill/task/count`| Number of tasks issued in the auto kill run.                                                                                                                                                                                                                                                                                                                                                                                                                                                    | |Varies|
-|`kill/candidateUnusedSegments/count`|Number of candidate unused segments to be deleted from the metadata store in an auto kill run.|`dataSource`|Varies|
+|`kill/candidateUnusedSegments/count`|The number of candidate unused segments eligible for deletion from the metadata store during an auto kill run for a datasource.|`dataSource`|Varies|


Suggested change

|`kill/candidateUnusedSegments/count`|The number of candidate unused segments eligible for deletion from the metadata store during an auto kill run for a datasource.|`dataSource`|Varies|

|`kill/eligibleUnusedSegments/count`|The number of unused segments of a datasource that are identified as eligible for deletion from the metadata store by the coordinator.|`dataSource`|Varies|

kfaraz

Only 1 minor suggestion regarding docs and metric name. But that change can be done later and need not block this PR. It would trigger the whole CI pipeline all over again.

+1

abhishekrb19 · 2024-02-27T06:44:34Z

Thanks, @kfaraz! I will add your rename suggestion in a follow-up patch.

Address review comment from apache#15941 (comment)

… & rename metric (#15977) * All segments stored in the same batch have the same created_date entry. In the absence of a group_id column, this metadata would allow us to easily reason about and troubleshoot ingestion-related issues. * Rename metric name and code references to eligibleUnusedSegments. Address review comment from #15941 (comment)

abhishekrb19 added the Bug label Feb 22, 2024

abhishekrb19 marked this pull request as draft February 22, 2024 14:13

abhishekrb19 added the WIP label Feb 22, 2024

abhishekrb19 requested review from zachjsh and kfaraz February 22, 2024 14:18