Skip to content

Resolve failures pauseless ingestion (no reingestion)#14853

Closed
9aman wants to merge 30 commits intoapache:masterfrom
9aman:resolve-failures-pauseless-ingestion-no-reingestion
Closed

Resolve failures pauseless ingestion (no reingestion)#14853
9aman wants to merge 30 commits intoapache:masterfrom
9aman:resolve-failures-pauseless-ingestion-no-reingestion

Conversation

@9aman
Copy link
Contributor

@9aman 9aman commented Jan 21, 2025

Pauseless Ingestion Failure Resolution

Please refer to PR: #14741 for happy path. This PR aims to only cover the failure scenarios. Once the above one is merged a better diff covering only failures will be visible.

To view only diff covering failure scenarios, for the time being, refer to:

Summary

This PR aims to provide ways to resolve the failure scenarios that we can encounter during pauseless ingestion. The detailed list of failure scenarios can be found here: link along with the failure handling strategies: link

Following sequence diagrams summarizes the failure scenarios and the resolution.
Screenshot 2025-01-03 at 2 53 46 PM
Screenshot 2025-01-03 at 2 54 45 PM

Failure Scenarios & Resolution Approaches

Failures encountered during the commit protocol can be categorized into two types: recoverable and unrecoverable failures.

Recoverable failures are those in which at least one of the servers retains the segment on disk.

Unrecoverable failures occur when none of the servers have the segment on disk.

Recoverable Failures

Recoverable failures will be addressed through RealtimeSegmentValidationManager. This approach will handle scenarios such as upload failures and incomplete commit protocol executions.

The controller or server can run into issues in between any of the steps of the commit protocol as listed below:

Request Type: COMMIT_START

  1. Update the Segment ZK metadata for the committing segment (seg__0__0)
    • Change status to COMMITTING
    • Set endOffset
  2. Create Segment ZK metadata for the new segment (seg__0__1) with status IN_PROGRESS
  3. Update the Ideal State for the:
    • Committing segment (seg__0__0) to ONLINE
    • New/ Consuming segment (seg__0__1) to CONSUMING

Request Type: COMMIT_END_METADATA
4. Update Segment ZK metadata for the committing segment (seg__0__0):
- Change status to DONE.
- Update deepstore url.
- Any additional metadata.

The RealtimeSegmentValidationManager figures out which step of the commit protocol failed and how can it be fixed. This is very similar to how commit protocol failures were handled before with some minor changes.

Non-recoverable Failures (will be covered in a separate PR)

These failures require ingesting the segment again from upstream, followed by build, upload and ZK metadata update.

9aman and others added 30 commits January 2, 2025 16:57
1. Changing FSM
2. Changing the 3 steps performed during the commit protocol to update ZK and Ideal state
1. Changes in the commit protocol to start segment commit before the build
2. Changes in the BaseTableDataManager to ensure that the locally built segment is replaced by a downloaded one
   only when the CRC is present in the ZK Metadata
3. Changes in the download segment method to allow waited download in case of pauseless consumption
…segment commit end metadata call

Refactoing code for redability
… ingestion by moving it out of streamConfigMap
…auseless ingestion in RealtimeSegmentValidationManager
…d by RealtimeSegmentValitdationManager to fix commit protocol failures
@codecov-commenter
Copy link

codecov-commenter commented Jan 21, 2025

Codecov Report

Attention: Patch coverage is 38.26087% with 142 lines in your changes missing coverage. Please review.

Project coverage is 63.72%. Comparing base (59551e4) to head (58082a2).
Report is 1600 commits behind head on master.

Files with missing lines Patch % Lines
...che/pinot/server/api/resources/TablesResource.java 0.00% 70 Missing ⚠️
.../core/realtime/PinotLLCRealtimeSegmentManager.java 55.00% 40 Missing and 5 partials ⚠️
...e/pinot/common/utils/FileUploadDownloadClient.java 0.00% 10 Missing ⚠️
...ommon/metadata/segment/SegmentZKMetadataUtils.java 75.00% 3 Missing and 3 partials ⚠️
...r/validation/RealtimeSegmentValidationManager.java 0.00% 3 Missing ⚠️
...troller/helix/core/util/FailureInjectionUtils.java 50.00% 1 Missing and 1 partial ⚠️
...ata/manager/realtime/RealtimeTableDataManager.java 0.00% 2 Missing ⚠️
...altime/ServerSegmentCompletionProtocolHandler.java 0.00% 2 Missing ⚠️
...ntroller/helix/core/PinotHelixResourceManager.java 0.00% 1 Missing ⚠️
.../pinot/core/data/manager/BaseTableDataManager.java 88.88% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #14853      +/-   ##
============================================
+ Coverage     61.75%   63.72%   +1.97%     
- Complexity      207     1612    +1405     
============================================
  Files          2436     2710     +274     
  Lines        133233   151476   +18243     
  Branches      20636    23379    +2743     
============================================
+ Hits          82274    96534   +14260     
- Misses        44911    47687    +2776     
- Partials       6048     7255    +1207     
Flag Coverage Δ
custom-integration1 100.00% <ø> (+99.99%) ⬆️
integration 100.00% <ø> (+99.99%) ⬆️
integration1 100.00% <ø> (+99.99%) ⬆️
integration2 0.00% <ø> (ø)
java-11 63.65% <38.26%> (+1.94%) ⬆️
java-21 63.62% <38.26%> (+1.99%) ⬆️
skip-bytebuffers-false 63.67% <38.26%> (+1.92%) ⬆️
skip-bytebuffers-true 63.60% <38.26%> (+35.87%) ⬆️
temurin 63.72% <38.26%> (+1.97%) ⬆️
unittests 63.72% <38.26%> (+1.97%) ⬆️
unittests1 56.30% <59.61%> (+9.41%) ⬆️
unittests2 34.05% <26.95%> (+6.32%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@9aman 9aman closed this Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants