Skip to content

Conversation

Technoboy-
Copy link
Contributor

Motivation

We're facing data loss during an cluster upgrade. The issue manifests as three distinct error types post-upgrade: BadVersion errors, "No Such Ledger" errors, and ledger authentication failures. The root cause appears to be a metadata synchronization issue where the Pulsar metadata store implementation lacks proper sync calls before listing ledgers.

As bk side, there is a sync call before getChildren
https://github.com/apache/bookkeeper/blob/2789316c18e12cbb6d17fa4a023410dbad6593a0/bookkeeper-server/src/main/java/org/apache/bookkeeper/util/ZkUtils.java#L284-L311

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

@Technoboy- Technoboy- self-assigned this Aug 26, 2025
@Technoboy- Technoboy- added this to the 4.1.0 milestone Aug 26, 2025
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Aug 26, 2025
Copy link
Member

@nodece nodece left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@codecov-commenter
Copy link

codecov-commenter commented Aug 26, 2025

Codecov Report

❌ Patch coverage is 75.00000% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.26%. Comparing base (da0d116) to head (ae88232).
⚠️ Report is 8 commits behind head on master.

Files with missing lines Patch % Lines
...sar/metadata/impl/FaultInjectionMetadataStore.java 0.00% 4 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #24665      +/-   ##
============================================
+ Coverage     74.25%   74.26%   +0.01%     
+ Complexity    33174    32843     -331     
============================================
  Files          1884     1885       +1     
  Lines        146943   146963      +20     
  Branches      16882    16929      +47     
============================================
+ Hits         109110   109143      +33     
+ Misses        29163    29126      -37     
- Partials       8670     8694      +24     
Flag Coverage Δ
inttests 26.71% <25.00%> (+0.04%) ⬆️
systests 22.67% <25.00%> (-0.63%) ⬇️
unittests 73.76% <75.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
.../org/apache/pulsar/metadata/api/MetadataStore.java 82.35% <ø> (+5.88%) ⬆️
.../bookkeeper/AbstractHierarchicalLedgerManager.java 77.41% <100.00%> (+0.75%) ⬆️
...kkeeper/LegacyHierarchicalLedgerRangeIterator.java 75.90% <100.00%> (+3.75%) ⬆️
...ookkeeper/LongHierarchicalLedgerRangeIterator.java 81.66% <100.00%> (ø)
...he/pulsar/metadata/impl/AbstractMetadataStore.java 85.96% <ø> (+0.70%) ⬆️
...che/pulsar/metadata/impl/RocksdbMetadataStore.java 70.50% <ø> (-0.34%) ⬇️
...ta/impl/batching/AbstractBatchedMetadataStore.java 95.69% <ø> (ø)
...e/pulsar/metadata/impl/oxia/OxiaMetadataStore.java 87.83% <ø> (ø)
...sar/metadata/impl/FaultInjectionMetadataStore.java 74.35% <0.00%> (-4.02%) ⬇️

... and 109 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@lhotari
Copy link
Member

lhotari commented Aug 26, 2025

It's great that we are finally resolving this issue.

Workaround for ZooKeeper, by-passing the Pulsar Metadata driver for BookKeeper client and BookKeeper server-side components:
I faced this a few years ago. As a workaround at that time I introduced a way to by-pass the use of Pulsar Metadata Drivers in BookKeepers with this change: #17834. That prevents loading the drivers. In addition to that, it's necessary to configure BookKeeper to use direct ZK access without Pulsar Metadata Drivers. On the Broker (BookKeeper client), this requires setting bookkeeperMetadataServiceUri: "zk+hierarchical://.." and on BookKeeper components it's possible to set metadataServiceUri: "zk+hierarchical://... (and set the legacy values zkServers and zkLedgersRootPath to empty values since metadataServiceUri is the recommended way after BP-29).

In pulsar-helm-chart this is configurable and implemented this way:
broker side:
https://github.com/apache/pulsar-helm-chart/blob/ca43a3f1d1c160d0bb0127f3bafefcdffa7f422c/charts/pulsar/templates/broker-configmap.yaml#L40-L44
bookkeeper side:
https://github.com/apache/pulsar-helm-chart/blob/ca43a3f1d1c160d0bb0127f3bafefcdffa7f422c/charts/pulsar/templates/_bookkeeper.tpl#L134-L141

Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More changes are needed to resolve the issue since store.getChildren(path) will use a cache that is updated asynchronously.

Another problem could be the potential performance implications. We might need to make changes on BookKeeper side so that caching would be safe under certain conditions if good performance relies on caching. Many environments have become reliant on the high level of caching, at least this is my understanding of the matter.

@Technoboy- Technoboy- changed the title [fix][metadata] Fix missing call sync method causing data lost [fix][metadata] Use getChildrenFromStore to avoid lost data Aug 27, 2025
@Technoboy- Technoboy- changed the title [fix][metadata] Use getChildrenFromStore to avoid lost data [fix][metadata] Use getChildrenFromStore to read children data to avoid lost data Aug 27, 2025
@Technoboy- Technoboy- changed the title [fix][metadata] Use getChildrenFromStore to read children data to avoid lost data [fix][meta] Use getChildrenFromStore to read children data to avoid lost data Aug 27, 2025
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@codelipenghui
Copy link
Contributor

/pulsarbot run-failure-checks

@Technoboy- Technoboy- merged commit 90a70db into apache:master Aug 28, 2025
96 of 98 checks passed
lhotari pushed a commit that referenced this pull request Aug 28, 2025
lhotari pushed a commit that referenced this pull request Aug 28, 2025
manas-ctds pushed a commit to datastax/pulsar that referenced this pull request Aug 28, 2025
… lost data (apache#24665)

(cherry picked from commit 90a70db)
(cherry picked from commit e4431bc)
Technoboy- added a commit to Technoboy-/pulsar that referenced this pull request Sep 10, 2025
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Sep 12, 2025
… lost data (apache#24665)

(cherry picked from commit 90a70db)
(cherry picked from commit e4431bc)
nborisov pushed a commit to nborisov/pulsar that referenced this pull request Sep 12, 2025
KannarFr pushed a commit to CleverCloud/pulsar that referenced this pull request Sep 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants