Skip to content

Conversation

rdhabalia
Copy link
Contributor

@rdhabalia rdhabalia commented Mar 21, 2023

Motivation

If due to any operation reason if schema ledger metadata gets deleted then broker is not able to handle that non-recoverable error and not able to load the topic. So, if one of the schema ledger is broken/deleted or ledger metadata is missing , then broker logs below exception because ledger without metadata gives NoSuchLedgerExistsOnMetadataServerException which is not considered as non-recoverable error.

2023-03-21T10:40:17,930-0700 [main-EventThread] ERROR org.apache.pulsar.broker.service.schema.SchemaRegistryServiceImpl - [prop1/ns1/t2] Put schema failed
java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.schema.exceptions.SchemaException: No such ledger exists on Metadata Server -  ledger=8 - operation=Failed to open ledger
        at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:368) ~[?:?]
        at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:377) ~[?:?]
        at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:1097) ~[?:?]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162) ~[?:?]
        at org.apache.pulsar.broker.service.schema.SchemaRegistryServiceImpl.lambda$trimDeletedSchemaAndGetList$30(SchemaRegistryServiceImpl.java:553) ~[pulsar-broker-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934) ~[?:?]
        at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911) ~[?:?]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162) ~[?:?]
        at org.apache.pulsar.broker.service.schema.BookkeeperSchemaStorage.lambda$openLedger$41(BookkeeperSchemaStorage.java:603) ~[pulsar-broker-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.apache.bookkeeper.client.LedgerOpenOp.openComplete(LedgerOpenOp.java:267) ~[bookkeeper-server-4.15.4.jar:4.15.4]
        at org.apache.bookkeeper.client.LedgerOpenOp.lambda$initiate$0(LedgerOpenOp.java:118) ~[bookkeeper-server-4.15.4.jar:4.15.4]
        at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:990) ~[?:?]
        at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:974) ~[?:?]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162) ~[?:?]
        at org.apache.pulsar.metadata.bookkeeper.PulsarLedgerManager.lambda$readLedgerMetadata$2(PulsarLedgerManager.java:208) ~[pulsar-metadata-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:718) ~[?:?]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]
        at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147) ~[?:?]
        at org.apache.pulsar.metadata.impl.ZKMetadataStore.handleGetResult(ZKMetadataStore.java:269) ~[pulsar-metadata-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.apache.pulsar.metadata.impl.ZKMetadataStore.lambda$batchOperation$5(ZKMetadataStore.java:219) ~[pulsar-metadata-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.apache.pulsar.metadata.impl.PulsarZooKeeperClient$3$1.processResult(PulsarZooKeeperClient.java:489) ~[pulsar-metadata-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]

Modifications

Handle NoSuchLedgerExistsOnMetadataServerException non-recoverable error and successfully load such topic with broken schema ledger.

Verifying this change

  • Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (10MB)
  • Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository:

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@poorbarcode
Copy link
Contributor

/pulsarbot rerun-failure-checks

@cbornet cbornet closed this Apr 11, 2023
@cbornet cbornet reopened this Apr 11, 2023
@cbornet cbornet modified the milestones: 3.0.0, 3.1.0 Apr 11, 2023
@Denovo1998
Copy link
Contributor

The PR of type fix should be deferred to 3.0.1.

@codecov-commenter
Copy link

codecov-commenter commented Apr 21, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 69.48%. Comparing base (1e44ba1) to head (1bd2776).
Report is 1729 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@              Coverage Diff              @@
##             master   #19882       +/-   ##
=============================================
+ Coverage     24.42%   69.48%   +45.06%     
- Complexity      291     3652     +3361     
=============================================
  Files          1603     1873      +270     
  Lines        124343   152550    +28207     
  Branches      13571    18653     +5082     
=============================================
+ Hits          30369   106004    +75635     
+ Misses        89490    37612    -51878     
- Partials       4484     8934     +4450     
Flag Coverage Δ
inttests 25.77% <ø> (+1.35%) ⬆️
systests 26.66% <ø> (?)
unittests 71.75% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...broker/service/schema/BookkeeperSchemaStorage.java 78.72% <ø> (+17.32%) ⬆️

... and 1591 files with indirect coverage changes

@codelipenghui
Copy link
Contributor

/pulsarbot run-failure-checks

Copy link
Contributor

@poorbarcode poorbarcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions:

  1. Should we make it will only work if enabled autoSkipNonRecoverableData?
  2. This PR is the same as #18010. I do not expect to use 18010 instead current one, but suggestion copying the test which In #18010 into the current PR, and then I can close PR #18010

@Denovo1998
Copy link
Contributor

On the first question, I asked the same question in PR 18010. Whether it should be at the time of enable autoSkipNonRecoverableData, to get NoSuchLedgerExistsOnMetadataServerException as a recoverable error.
#18010 (comment)

@codelipenghui
Copy link
Contributor

@rdhabalia Could you please take a look at @poorbarcode 's comment?

@github-actions
Copy link

github-actions bot commented Jun 8, 2023

The pr had no activity for 30 days, mark with Stale label.

@github-actions github-actions bot added the Stale label Jun 8, 2023
@Technoboy- Technoboy- modified the milestones: 3.1.0, 3.2.0 Jul 31, 2023
@Technoboy- Technoboy- removed this from the 3.2.0 milestone Dec 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants