Skip to content

CSPL-3354: REBASED Add Lifecycle Hooks and Configurable Termination Grace Period to Splunk Operator #1450

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 19 commits into
base: develop
Choose a base branch
from

Conversation

patrykw-splunk
Copy link
Collaborator

@patrykw-splunk patrykw-splunk commented Feb 21, 2025

Overview

This Pull Request introduces enhancements to the Splunk Operator by integrating Lifecycle Hooks and allowing customers to configure the Termination Grace Period via the Custom Resource (Common Spec). These changes aim to ensure graceful shutdowns of Splunk pods, thereby maintaining data integrity and improving the reliability of Splunk deployments on Kubernetes.

Problem Statement

Customers running Splunk on Kubernetes have reported issues related to abrupt pod terminations, especially during node recycling or maintenance operations. Without proper shutdown procedures, Splunk instances may not decommission gracefully, leading to potential data loss and increased operational churn. Additionally, the lack of configurable grace periods limits customers' ability to tailor shutdown behaviors to their specific environments and requirements.

Proposed Solution

  1. Integrate Lifecycle Hooks:

    • preStop Hook: Executes splunk offline and splunk stop commands before the pod is terminated. This ensures that Splunk instances decommission gracefully, preventing data corruption and loss.
  2. Configurable Termination Grace Period:

    • Custom Resource Update: Introduce a new field in the Common Spec of the Splunk Operator’s Custom Resource to allow customers to specify terminationGracePeriodSeconds.
    • Default Value: If not specified by the customer, a sensible default (e.g., 60 seconds) is applied to ensure sufficient time for graceful shutdowns.

Changes Made

  • Custom Resource Definition:

    • Added terminationGracePeriodSeconds under the commonSpec section to allow customization.
    apiVersion: enterprise.splunk.com/v4
    kind: IndexerCluster
    metadata:
      name: indexer-splunk
    spec:
        terminationGracePeriodSeconds: 120 # Customizable grace period in seconds
        # ... other common specifications
      # ... other cluster specifications
  • StatefulSet Template Update:

    • Modified the StatefulSet templates generated by the Splunk Operator to include the lifecycle section with the preStop hook.
    • Incorporated the terminationGracePeriodSeconds value from the Common Spec.
    spec:
      terminationGracePeriodSeconds: {{ .Spec.TerminationGracePeriodSeconds | default 60 }}
      containers:
        - name: splunk
          image: splunk/splunk:latest
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "splunk offline && splunk stop"]
          # ... other container configurations

Benefits

  • Graceful Shutdowns: Ensures that Splunk pods decommission properly, maintaining data integrity and reducing the risk of corruption.
  • Customization: Empowers customers to define their own termination grace periods based on their operational needs and Splunk’s shutdown requirements.
  • Improved Reliability: Minimizes unexpected downtime and operational issues related to abrupt pod terminations.
  • Kubernetes Best Practices: Aligns Splunk deployments with Kubernetes lifecycle management best practices, enhancing overall deployment robustness.

Related Issues

  • Closes #CSPL-3354: Implement lifecycle hooks for graceful pod shutdowns. Add configurable termination grace period to Splunk Operator Custom Resource.

Testing Performed

  1. Unit Tests:

    • Verified that the terminationGracePeriodSeconds from the Custom Resource is correctly applied to the StatefulSet.
    • Ensured that the preStop lifecycle hook executes the appropriate Splunk commands.
  2. Integration Tests:

    • Deployed the updated Splunk Operator in a staging environment.
    • Simulated pod terminations and confirmed that splunk offline and splunk stop commands were executed before termination.
    • Tested with different terminationGracePeriodSeconds values to ensure flexibility and correctness.
  3. Manual Testing:

    • Conducted node recycling operations to observe the behavior of Splunk pods during graceful shutdowns.
    • Verified that no data loss or corruption occurred during pod recycling.

Documentation Updates

  • Operator README:

    • Added sections detailing the new terminationGracePeriodSeconds field in the Custom Resource.
    • Provided examples demonstrating how to configure lifecycle hooks and grace periods.
  • Configuration Guides:

    • Updated guides to include best practices for setting terminationGracePeriodSeconds based on different deployment scenarios.

How to Test

  1. Update Custom Resource:

    • Modify the terminationGracePeriodSeconds in your Splunk Operator Custom Resource.
  2. Deploy or Update Splunk Cluster:

    • Apply the updated Custom Resource to deploy or update your Splunk cluster.
  3. Verify StatefulSet Configuration:

    • Ensure that the StatefulSet includes the preStop lifecycle hook and the correct terminationGracePeriodSeconds.
  4. Simulate Pod Termination:

    • Manually delete a Splunk pod and observe the execution of the preStop hook.
    • Confirm that Splunk gracefully shuts down before the pod is terminated.

Future Considerations

  • Enhanced Shutdown Commands: Explore the possibility of using splunk decommission if it provides more comprehensive shutdown procedures compared to splunk offline and splunk stop.
  • Dynamic Configuration: Allow for dynamic updates to the terminationGracePeriodSeconds without requiring full cluster redeployments.
  • Monitoring and Alerts: Integrate monitoring to track the execution and success of lifecycle hooks, providing alerts in case of failures.

Reviewer Notes

  • Backward Compatibility: Ensure that existing deployments without the terminationGracePeriodSeconds field continue to operate with the default grace period.
  • Security Considerations: Validate that the execution of shutdown commands does not introduce security vulnerabilities or expose sensitive information.
  • Performance Impact: Assess any potential performance implications of the added lifecycle hooks during pod terminations.

Pull Request Checklist:

  • Code changes adhere to the project's coding standards.
  • Relevant unit and integration tests are included.
  • Documentation has been updated accordingly.
  • All tests pass locally.
  • The PR description follows the project's guidelines.

@coveralls
Copy link
Collaborator

Pull Request Test Coverage Report for Build 13568057412

Details

  • 52 of 54 (96.3%) changed or added relevant lines in 2 files are covered.
  • 41 unchanged lines in 5 files lost coverage.
  • Overall coverage decreased (-0.3%) to 86.289%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/splunk/controller/util.go 39 41 95.12%
Files with Coverage Reduction New Missed Lines %
pkg/splunk/enterprise/afwscheduler.go 1 92.93%
pkg/splunk/enterprise/clustermaster.go 2 76.59%
pkg/splunk/enterprise/licensemanager.go 6 73.29%
pkg/splunk/enterprise/licensemaster.go 16 66.67%
pkg/splunk/enterprise/standalone.go 16 64.15%
Totals Coverage Status
Change from base Build 13313195382: -0.3%
Covered Lines: 10554
Relevant Lines: 12231

💛 - Coveralls

@coveralls
Copy link
Collaborator

coveralls commented Feb 27, 2025

Pull Request Test Coverage Report for Build 14198533480

Details

  • 52 of 54 (96.3%) changed or added relevant lines in 2 files are covered.
  • 123 unchanged lines in 6 files lost coverage.
  • Overall coverage decreased (-0.3%) to 86.319%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/splunk/controller/util.go 39 41 95.12%
Files with Coverage Reduction New Missed Lines %
pkg/splunk/enterprise/cp.go 1 33.33%
pkg/splunk/enterprise/clustermaster.go 2 76.59%
pkg/splunk/enterprise/licensemanager.go 6 73.29%
pkg/splunk/enterprise/licensemaster.go 16 66.67%
pkg/splunk/enterprise/standalone.go 16 64.15%
pkg/splunk/enterprise/afwscheduler.go 82 92.96%
Totals Coverage Status
Change from base Build 13946362371: -0.3%
Covered Lines: 10556
Relevant Lines: 12229

💛 - Coveralls

@Igor-splunk Igor-splunk changed the title [Draft] CSPL-3354: REBASED Add Lifecycle Hooks and Configurable Termination Grace Period to Splunk Operator CSPL-3354: REBASED Add Lifecycle Hooks and Configurable Termination Grace Period to Splunk Operator Mar 3, 2025
Copy link
Collaborator

@vivekr-splunk vivekr-splunk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happened to old PR , also we need description in the PR,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants