Resolve failures pauseless ingestion (no reingestion)#14853
Closed
9aman wants to merge 30 commits intoapache:masterfrom
Closed
Resolve failures pauseless ingestion (no reingestion)#148539aman wants to merge 30 commits intoapache:masterfrom
9aman wants to merge 30 commits intoapache:masterfrom
Conversation
1. Changing FSM 2. Changing the 3 steps performed during the commit protocol to update ZK and Ideal state
1. Changes in the commit protocol to start segment commit before the build 2. Changes in the BaseTableDataManager to ensure that the locally built segment is replaced by a downloaded one only when the CRC is present in the ZK Metadata 3. Changes in the download segment method to allow waited download in case of pauseless consumption
…segment commit end metadata call Refactoing code for redability
… ingestion by moving it out of streamConfigMap
…auseless ingestion in RealtimeSegmentValidationManager
…d by RealtimeSegmentValitdationManager to fix commit protocol failures
…g commit protocol
…ption is enabled or not
…eepstore path with fallbacks
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #14853 +/- ##
============================================
+ Coverage 61.75% 63.72% +1.97%
- Complexity 207 1612 +1405
============================================
Files 2436 2710 +274
Lines 133233 151476 +18243
Branches 20636 23379 +2743
============================================
+ Hits 82274 96534 +14260
- Misses 44911 47687 +2776
- Partials 6048 7255 +1207
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pauseless Ingestion Failure Resolution
Please refer to PR: #14741 for happy path. This PR aims to only cover the failure scenarios. Once the above one is merged a better diff covering only failures will be visible.
To view only diff covering failure scenarios, for the time being, refer to:
Summary
This PR aims to provide ways to resolve the failure scenarios that we can encounter during pauseless ingestion. The detailed list of failure scenarios can be found here: link along with the failure handling strategies: link
Following sequence diagrams summarizes the failure scenarios and the resolution.


Failure Scenarios & Resolution Approaches
Failures encountered during the commit protocol can be categorized into two types: recoverable and unrecoverable failures.
Recoverable failures are those in which at least one of the servers retains the segment on disk.
Unrecoverable failures occur when none of the servers have the segment on disk.
Recoverable Failures
Recoverable failures will be addressed through RealtimeSegmentValidationManager. This approach will handle scenarios such as upload failures and incomplete commit protocol executions.
The controller or server can run into issues in between any of the steps of the commit protocol as listed below:
Request Type: COMMIT_START
Request Type: COMMIT_END_METADATA
4. Update Segment ZK metadata for the committing segment (seg__0__0):
- Change status to DONE.
- Update deepstore url.
- Any additional metadata.
The RealtimeSegmentValidationManager figures out which step of the commit protocol failed and how can it be fixed. This is very similar to how commit protocol failures were handled before with some minor changes.
Non-recoverable Failures (will be covered in a separate PR)
These failures require ingesting the segment again from upstream, followed by build, upload and ZK metadata update.