Implement Abandoned Job Detection and Recovery #30367

jgambarios · 2024-10-16T17:39:02Z

Parent Issue

Task

We need to enhance our job queue system to handle abandoned jobs. These are jobs that may have been interrupted due to server crashes, network failures, or other unexpected issues, leaving them in an inconsistent state.

Objective:

Implement mechanisms to detect abandoned jobs and provide recovery strategies to ensure system reliability and data consistency.

Proposed Strategies:

Job Heartbeats:
- Implement periodic heartbeat updates for running jobs
- Create a background process to identify jobs with stale heartbeats
Timeout Mechanisms:
- Add a max_execution_time field to job configurations
- Implement a background process to check for jobs exceeding their maximum execution time
Recovery Procedures:
- Develop a recovery process
- Identify jobs in inconsistent states and apply appropriate recovery actions

Additional Considerations:

Ensure that abandoned job recovery doesn't conflict with distributed locking mechanisms
Consider the impact on job queue performance and optimize where necessary
Evaluate and document any changes to the system's fault tolerance and high availability characteristics

Proposed Objective

Core Features

Proposed Priority

Priority 2 - Important

Acceptance Criteria

System can detect jobs that have been abandoned due to server crashes or other issues
Abandoned jobs are automatically handled according to configured recovery strategies
All new functionality is covered by appropriate tests
System performance is not significantly impacted by new abandoned job handling processes

The text was updated successfully, but these errors were encountered:

jgambarios added Team : Scout Triage Type : Task labels Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Abandoned Job Detection and Recovery #30367

Implement Abandoned Job Detection and Recovery #30367

jgambarios commented Oct 16, 2024 •

edited

Loading

Implement Abandoned Job Detection and Recovery #30367

Implement Abandoned Job Detection and Recovery #30367

Comments

jgambarios commented Oct 16, 2024 • edited Loading

Parent Issue

Task

Objective:

Additional Considerations:

Proposed Objective

Proposed Priority

Acceptance Criteria

jgambarios commented Oct 16, 2024 •

edited

Loading