Skip to content

Conversation

@yyj8
Copy link
Contributor

@yyj8 yyj8 commented Jul 10, 2024

Fixes #23020

Main Issue: #xyz

PIP: #xyz

Motivation

The bookie process is running normally and there is corresponding bookie information in the metadata. Therefore, the broker's cache should also have corresponding bookie information. But the broker's cache does not have corresponding bookie information and causing broker to misjudge that the bookie is unavailable.

The broker prints the following exception information:

2024-06-11T12:26:24,569+0800 [pulsar-io-18-1] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.172.240.12:54152][persistent://public/default/writeCKDeadLetter-partition-0] Creating producer. producerId=1
2024-06-11T12:26:24,569+0800 [pulsar-io-18-1] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Opening managed ledger public/default/persistent/writeCKDeadLetter-partition-0
2024-06-11T12:26:24,570+0800 [main-EventThread] INFO  org.apache.bookkeeper.client.DefaultBookieAddressResolver - Cannot resolve 127.0.0.1:3181, bookie is unknown org.apache.bookkeeper.client.BKException$BKBookieHandleNotAvailableException: Bookie handle is not available
2024-06-11T12:26:24,570+0800 [main-EventThread] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Cannot connect to 127.0.0.1:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$BookieIdNotResolvedException: Cannot resolve bookieId 127.0.0.1:3181, bookie does not exist or it is not running
2024-06-11T12:26:24,570+0800 [BookKeeperClientWorker-OrderedExecutor-0-0] ERROR org.apache.bookkeeper.client.ReadLastConfirmedOp - While readLastConfirmed ledger: 31003 did not hear success responses from all quorums, QuorumCoverage(e:1,w:1,a:1) = [-8]
2024-06-11T12:26:24,570+0800 [BookKeeperClientWorker-OrderedExecutor-0-0] ERROR org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [public/default/persistent/writeCKDeadLetter-partition-0] Failed to open ledger 31003: Error while recovering ledger
2024-06-11T12:26:24,570+0800 [BookKeeperClientWorker-OrderedExecutor-0-0] ERROR org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl - [public/default/persistent/writeCKDeadLetter-partition-0] Failed to initialize managed ledger: Error while recovering ledger
2024-06-11T12:26:24,570+0800 [BookKeeperClientWorker-OrderedExecutor-0-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [public/default/persistent/writeCKDeadLetter-partition-0] Closing managed ledger

Modifications

pulsar-metadata/src/main/java/org/apache/pulsar/metadata/bookkeeper/PulsarRegistrationClient.java#getBookieServiceInfo

yyj8@c52069f#diff-7a5305b98183695c3d8246b9e9ccafa68180d7f9e8df5534019d6bbcc59a90f6

Verifying this change

  • Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (10MB)
  • Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository:

…readOnlyBookieInfo cache update fail causing broker to misjudge that the bookie is unavailable.
@github-actions
Copy link

@yyj8 Please add the following content to your PR description and select a checkbox:

- [ ] `doc` <!-- Your PR contains doc changes -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [ ] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

@github-actions github-actions bot added doc-not-needed Your PR changes do not impact docs and removed doc-label-missing labels Jul 10, 2024
@lhotari lhotari requested review from merlimat and rdhabalia October 9, 2024 13:33
@lhotari lhotari added the release/blocker Indicate the PR or issue that should block the release until it gets resolved label Oct 9, 2024
@lhotari lhotari requested a review from hangc0276 May 14, 2025 08:45
if ((info = writableBookieInfo.get(bookieId)) == null) {
log.warn("Bookie {} not found from all cache, now update writableBookieInfo cache "
+ "and then try to get from writableBookieInfo cache.", bookieId);
readBookieInfoAsWritableBookie(bookieId);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readBookieInfoAsWritableBookie returns a future so this doesn't seem to be the correct approach here. It would be necessary to use asynchronous callback chaining using CompletableFuture's .then* methods.

@lhotari
Copy link
Member

lhotari commented May 14, 2025

It seems that #20642 attempted to fix some issues in this area. Useful for more context.

@lhotari lhotari requested a review from mattisonchao May 14, 2025 08:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc-not-needed Your PR changes do not impact docs ready-to-test release/blocker Indicate the PR or issue that should block the release until it gets resolved

Projects

None yet

2 participants