Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor and add tests and metric to KillUnusedSegments duty (auto-kill) #15941

Merged
merged 13 commits into from
Feb 27, 2024

Conversation

abhishekrb19
Copy link
Contributor

@abhishekrb19 abhishekrb19 commented Feb 22, 2024

This PR makes a few functional and non-functional changes in the auto-kill KillUnusedSegments duty.

Functional changes:

  • Add a new metric kill/eligibleUnusedSegments/count with dimension dataSource that tracks the number of candidate unused segments per datasource per kill cycle. It's candidate because it's the kill task that ultimately kills the unused segments and the number can change.
  • Move a debug log to info for improved visibility. This should be fine as it gets logged only once per indexing/kill task period.
  • Change the kill duty DruidException validations from user to operator persona as they're configured by operators typically.

Refactoring changes:

  • Remove redundant checks in the code for code readability.
  • Remove all the unnecessary @VisibleForTesting annotations as the test code now verifies the duty directly instead of mocks + various hooks into the code.
  • Finalize some parameters and variables to clarify the intent.

Testing changes:

  • Updated tests from using mocks (the mocks were difficult to use and buggy - for example, last updated timestamp was never used.
  • Fix a bug where TestDruidCoordinatorConfig.DEFAULT_COORDINATOR_KILL_BUFFER_PERIOD was incorrectly set to 1 day instead of 90 days. While at it, I changed all the test coordinator defaults to match the same ISO 8601 period format as DruidCoordinatorConfig, so it's easy to spot check.
  • Add more test coverage for the different config parameters.
  • Add a couple of unit tests that are ignored currently as they fail. The duty doesn't consider segments spanning eternity, or end in DateTimes.MAX (outside the normal Datetime string comparison range): Auto-kill doesn't delete segments outside the range [0000-01-01/10000-01-01) #15951

Motivation

In general, I found it difficult to track the progress of the auto-kill coordinator duty. Also, see some discussion in the community slack: https://apachedruidworkspace.slack.com/archives/C0303FDCZEZ/p1707941058448649.
My preliminary conclusion from looking at a couple of clusters is that the auto-kill defaults are overly conservative and need to be revisited as #15912 notes. I would like to address some of the concerns raised in #15911 and #15912 separately once we have some additional data and tests from this patch.

Release note

Add a new metric kill/eligibleUnusedSegments/count with dimension dataSource that tracks the number of eligible unused segments of a datasource that are eligible for deletion by the coordinator.


Key changed/added classes in this PR
  • KillUnusedSegments.java
  • KillUnusedSegmentsTest.java

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Initial commit with:
- Bug fixes - auto-kill can throw NPE when there are no datasources present and defaults mismatch.
- Add new stat for candidate segment intervals killed.
- Move a couple of debug logs to info logs for improved visibility (should only log once per kill period).
- Remove redundant checks for code readability.
- Updated tests from using mocks (also the mocks weren't using last updated timestamp) and
  add more test coverage for different config parameters.
- Add a couple of unit tests that are ignored for the eternity case to prove that
  the kill duty doesn't clean up segments with ALL grain or that end in DateTimes.MAX.
- Migrate Druid exception from user to operator persona.
@abhishekrb19 abhishekrb19 marked this pull request as draft February 22, 2024 14:13
Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, the changes make the code more readable. I couldn't find the main NPE bugfix amidst all the refactor. If possible, maybe create two separate PRs - one to add the metric and the bugfix and another for all the refactors?

I have left some comments even though the PR is still in draft, so that changes can be incorporated early.

@abhishekrb19 abhishekrb19 changed the title WIP: Auto-kill bug fixes and improvements Refactor and add tests and metric to auto-kill Feb 23, 2024
@abhishekrb19 abhishekrb19 changed the title Refactor and add tests and metric to auto-kill Refactor and add tests and metric to KillUnusedSegments duty (auto-kill) Feb 23, 2024
@abhishekrb19 abhishekrb19 marked this pull request as ready for review February 23, 2024 04:58
@abhishekrb19
Copy link
Contributor Author

Thanks for the review, @kfaraz! I've addressed your comments and kept the new metric along with other minor functional changes in this patch. But let me know if you'd prefer to keep the changes separate. Re the NPE fix, my bad, it was from existing test code --TestSegmentsMetadataManager was returning null instead of an empty set for retrieveAllDataSourceNames(). Removed that from the PR description.

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the prompt response, @abhishekrb19 !
Left a few minor comments, rest of the changes look good.

abhishekrb19 and others added 2 commits February 25, 2024 11:45
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
- Clarify docs on eligibility.
- Add test for multiple segments in the same interval. Clarify comment.
- Remove log line from test.
- Remove lastUpdatedDate = now.plus(10) from test.
@@ -342,7 +342,7 @@ These metrics are for the Druid Coordinator and are reset each time the Coordina
|`killTask/availableSlot/count`| Number of available task slots that can be used for auto kill tasks in the auto kill run. This is the max number of task slots minus any currently running auto kill tasks. | |Varies|
|`killTask/maxSlot/count`| Maximum number of task slots available for auto kill tasks in the auto kill run. | |Varies|
|`kill/task/count`| Number of tasks issued in the auto kill run. | |Varies|
|`kill/candidateUnusedSegments/count`|Number of candidate unused segments to be deleted from the metadata store in an auto kill run.|`dataSource`|Varies|
|`kill/candidateUnusedSegments/count`|The number of candidate unused segments eligible for deletion from the metadata store during an auto kill run for a datasource.|`dataSource`|Varies|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
|`kill/candidateUnusedSegments/count`|The number of candidate unused segments eligible for deletion from the metadata store during an auto kill run for a datasource.|`dataSource`|Varies|
|`kill/eligibleUnusedSegments/count`|The number of unused segments of a datasource that are identified as eligible for deletion from the metadata store by the coordinator.|`dataSource`|Varies|

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only 1 minor suggestion regarding docs and metric name. But that change can be done later and need not block this PR. It would trigger the whole CI pipeline all over again.

+1

@abhishekrb19
Copy link
Contributor Author

Thanks, @kfaraz! I will add your rename suggestion in a follow-up patch.

@abhishekrb19 abhishekrb19 merged commit 38ecf98 into apache:master Feb 27, 2024
83 checks passed
@abhishekrb19 abhishekrb19 deleted the auto_kill_enhancements branch February 27, 2024 06:44
abhishekrb19 added a commit to abhishekrb19/incubator-druid that referenced this pull request Feb 27, 2024
abhishekrb19 added a commit to abhishekrb19/incubator-druid that referenced this pull request Feb 27, 2024
abhishekrb19 added a commit that referenced this pull request Feb 27, 2024
… & rename metric (#15977)

* All segments stored in the same batch have the same created_date entry.

In the absence of a group_id column, this metadata would allow us to easily
reason about and troubleshoot ingestion-related issues.

* Rename metric name and code references to eligibleUnusedSegments.

Address review comment from #15941 (comment)
@adarshsanjeev adarshsanjeev added this to the 30.0.0 milestone May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants