Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Abandoned Job Detection and Recovery #30367

Open
jgambarios opened this issue Oct 16, 2024 · 0 comments
Open

Implement Abandoned Job Detection and Recovery #30367

jgambarios opened this issue Oct 16, 2024 · 0 comments

Comments

@jgambarios
Copy link
Contributor

jgambarios commented Oct 16, 2024

Parent Issue

#29474

Task

We need to enhance our job queue system to handle abandoned jobs. These are jobs that may have been interrupted due to server crashes, network failures, or other unexpected issues, leaving them in an inconsistent state.

Objective:

Implement mechanisms to detect abandoned jobs and provide recovery strategies to ensure system reliability and data consistency.

Proposed Strategies:

  1. Job Heartbeats:

    • Implement periodic heartbeat updates for running jobs
    • Create a background process to identify jobs with stale heartbeats
  2. Timeout Mechanisms:

    • Add a max_execution_time field to job configurations
    • Implement a background process to check for jobs exceeding their maximum execution time
  3. Recovery Procedures:

    • Develop a recovery process
    • Identify jobs in inconsistent states and apply appropriate recovery actions

Additional Considerations:

  • Ensure that abandoned job recovery doesn't conflict with distributed locking mechanisms
  • Consider the impact on job queue performance and optimize where necessary
  • Evaluate and document any changes to the system's fault tolerance and high availability characteristics

Proposed Objective

Core Features

Proposed Priority

Priority 2 - Important

Acceptance Criteria

  1. System can detect jobs that have been abandoned due to server crashes or other issues
  2. Abandoned jobs are automatically handled according to configured recovery strategies
  3. All new functionality is covered by appropriate tests
  4. System performance is not significantly impacted by new abandoned job handling processes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: New
Development

No branches or pull requests

1 participant