(@bug W-5100764@) Write Ledger Handle listens for metadata changes #1580

jvrao · 2018-08-03T04:33:40Z

Writer owns the metadata of the current ensemble ensemble only.
Previous ensemble segments of the same ledger can freely modified
by the replication worker.

In the current code, write ledger handle, which allows reads also
are blind sided by the non-disruptive ensemble changes performed
by the replication worker. This could potentially direct readers
to wrong destination, leading to unsuccessful reads.

Fix this problem by placing a watcher on the zk node just like
readOnlyLedgerHandle. When new metadata is received take the
older (non-current) ensemble segment information and version
number from the new metadata.

Signed-off-by: Venkateswararao Jujjuri (JV) vjujjuri@salesforce.com

Descriptions of the changes in this PR:

Motivation

(Explain: why you're making that change, what is the problem you're trying to solve)

Changes

(Describe: what changes you have made)

Master Issue: #

In order to uphold a high standard for quality for code contributions, Apache BookKeeper runs various precommit
checks for pull requests. A pull request can only be merged when it passes precommit checks. However running all
the precommit checks can take a long time, some trivial changes don't need to run all the precommit checks. You
can check following list to skip the tests that don't need to run for your pull request. Leave them unchecked if
you are not sure, committers will help you:

[skip bookkeeper-server bookie tests]: skip testing org.apache.bookkeeper.bookie in bookkeeper-server module.

[skip bookkeeper-server client tests]: skip testing org.apache.bookkeeper.client in bookkeeper-server module.

[skip bookkeeper-server replication tests]: skip testing org.apache.bookkeeper.replication in bookkeeper-server module.

[skip bookkeeper-server tls tests]: skip testing org.apache.bookkeeper.tls in bookkeeper-server module.

[skip bookkeeper-server remaining tests]: skip testing all other tests in bookkeeper-server module.

[skip integration tests]: skip docker based integration tests. if you make java code changes, you shouldn't skip integration tests.

[skip build java8]: skip build on java8. ONLY skip this when ONLY changing files under documentation under site.

[skip build java9]: skip build on java9. ONLY skip this when ONLY changing files under documentation under site.

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

If this PR is a BookKeeper Proposal (BP):

Make sure the PR title is formatted like:
<BP-#>: Description of bookkeeper proposal
e.g. BP-1: 64 bits ledger is support

Attach the master issue link in the description of this PR.

Attach the google doc link if the BP is written in Google Doc.

Otherwise:

Make sure the PR title is formatted like:
<Issue #>: Description of pull request
e.g. Issue 123: Description ...

Make sure tests pass via mvn clean apache-rat:check install spotbugs:check.

Replace <Issue #> in the title with the actual Issue number.

Writer owns the metadata of the current ensemble ensemble only. Previous ensemble segments of the same ledger can freely modified by the replication worker. In the current code, write ledger handle, which allows reads also are blind sided by the non-disruptive ensemble changes performed by the replication worker. This could potentially direct readers to wrong destination, leading to unsuccessful reads. Fix this problem by placing a watcher on the zk node just like readOnlyLedgerHandle. When new metadata is received take the older (non-current) ensemble segment information and version number from the new metadata. Signed-off-by: Venkateswararao Jujjuri (JV) <vjujjuri@salesforce.com>

eolivelli

Very interesting case.
Nice fix.
+1

@ivankelly shall we merge this change before you proceed with your refactor on immutable metadata ?

eolivelli · 2018-08-03T08:49:58Z

@jvrao there are checkstyle issues, we must fix them before merging to master

2018-08-03T04:37:29.104 [INFO] Starting audit...
[ERROR] /home/jenkins/jenkins-slave/workspace/bookkeeper_precommit_pullrequest_validation/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/LedgerHandle.java:62:8: Unused import: org.apache.bookkeeper.client.ReadOnlyLedgerHandle.MetadataUpdater. [UnusedImports]
[ERROR] /home/jenkins/jenkins-slave/workspace/bookkeeper_precommit_pullrequest_validation/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/ReadOnlyLedgerHandle.java:34:8: Unused import: org.apache.bookkeeper.proto.BookkeeperInternalCallbacks.LedgerMetadataListener. [UnusedImports]
[ERROR] /home/jenkins/jenkins-slave/workspace/bookkeeper_precommit_pullrequest_validation/bookkeeper-server/src/test/java/org/apache/bookkeeper/client/BookieWriteLedgerTest.java:58:8: Unused import: org.apache.bookkeeper.common.util.OrderedScheduler. [UnusedImports]
[ERROR] /home/jenkins/jenkins-slave/workspace/bookkeeper_precommit_pullrequest_validation/bookkeeper-server/src/test/java/org/apache/bookkeeper/client/BookieWriteLedgerTest.java:250: Line is longer than 120 characters (found 139). [LineLength]
[ERROR] /home/jenkins/jenkins-slave/workspace/bookkeeper_precommit_pullrequest_validation/bookkeeper-server/src/test/java/org/apache/bookkeeper/client/BookieWriteLedgerTest.java:251: Line is longer than 120 characters (found 132). [LineLength]
[ERROR] /home/jenkins/jenkins-slave/workspace/bookkeeper_precommit_pullrequest_validation/bookkeeper-server/src/test/java/org/apache/bookkeeper/client/BookieWriteLedgerTest.java:274: Line is longer than 120 characters (found 142). [LineLength]
[ERROR] /home/jenkins/jenkins-slave/workspace/bookkeeper_precommit_pullrequest_validation/bookkeeper-server/src/test/java/org/apache/bookkeeper/client/BookieWriteLedgerTest.java:284: Line is longer than 120 characters (found 130). [LineLength]

eolivelli · 2018-08-03T08:51:14Z

@jvrao there are also compilation errors, maybe it is just enough to rebase to current master

2018-08-03\T\04:35:51.103 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.7.0:testCompile (default-testCompile) on project bookkeeper-server: Compilation failure: Compilation failure:
2018-08-03\T\04:35:51.103 [ERROR] /home/jenkins/jenkins-slave/workspace/bookkeeper_precommit_remaining_tests/bookkeeper-server/src/test/java/org/apache/bookkeeper/client/BookieWriteLedgerTest.java:[230,70] incompatible types: capture#1 of ? extends java.util.List<org.apache.bookkeeper.net.BookieSocketAddress> cannot be converted to java.util.ArrayList<org.apache.bookkeeper.net.BookieSocketAddress>
2018-08-03\T\04:35:51.103 [ERROR] /home/jenkins/jenkins-slave/workspace/bookkeeper_precommit_remaining_tests/bookkeeper-server/src/test/java/org/apache/bookkeeper/client/BookieWriteLedgerTest.java:[243,43] incompatible types: java.util.List<org.apache.bookkeeper.net.BookieSocketAddress> cannot be converted to java.util.ArrayList<org.apache.bookkeeper.net.BookieSocketAddress>
2018-08-03\T\04:35:51.103 [ERROR] /home/jenkins/jenkins-slave/workspace/bookkeeper_precommit_remaining_tests/bookkeeper-server/src/test/java/org/apache/bookkeeper/client/BookieWriteLedgerTest.java:[248,97] incompatible types: capture#2 of ? extends java.util.List<org.apache.bookkeeper.net.BookieSocketAddress> cannot be converted to java.util.ArrayList<org.apache.bookkeeper.net.BookieSocketAddress>
2018-08-03\T\04:35:51.103 [ERROR] /home/jenkins/jenkins-slave/workspace/bookkeeper_precommit_remaining_tests/bookkeeper-server/src/test/java/org/apache/bookkeeper/client/BookieWriteLedgerTest.java:[272,70] incompatible types: capture#3 of ? extends java.util.List<org.apache.bookkeeper.net.BookieSocketAddress> cannot be converted to java.util.ArrayList<org.apache.bookkeeper.net.BookieSocketAddress>

ivankelly

I disagree that the write owns the current ensemble. The consistent metadata store owns it, the writer can only make suggestions to change it. The writer should never use the metadata until that exact copy has been stored on the metadata store, which is why having mutable metadata is so dangerous. Anyhow, a philosophical point, and unrelated to whether this change is good.

One thing to note is that this will trigger more load on zookeeper. For example, with Pulsar, you may have 100,000s topics each with a ledger open, that 100,000s new watches. So I would add a parameter to ClientConfiguration to make this optional, and off by default (to not add new load to unsuspecting users). @merlimat @sijie Opinions on this?

ivankelly · 2018-08-03T09:40:33Z

bookkeeper-server/src/main/java/org/apache/bookkeeper/client/LedgerHandle.java

+        }
+        if (Version.Occurred.BEFORE == occurred) { // the metadata is updated
+            try {
+                bk.getMainWorkerPool().executeOrdered(ledgerId, new MetadataMerger(newMetadata));


There's no need to merge. The metadata read from zookeeper must have the same last ensemble as the metadata currently being used, or else we're violating a whole load of properties. So in theory, you should be able to assign newMetadata to metadata. In practice it can be tricky with all the mutation that occurs while handling bookie failure. So leave the merge for now, but I will remove it once the ledger immutable metadata changes are in (I should have remaining patches up today or monday/tuesday next week)

@ivankelly where are we now with the whole set of immutable changes?
There is another problem with this patch, the metadata is being accessed with and without lock in the code and that needs to be corrected too; may be covered as part of immutable changes. Also, we need to stop the writer as we discussed.

@jvrao the stack is blocked, waiting on #1589, but otherwise the change is pretty much ready.

ivankelly · 2018-08-03T09:44:39Z

bookkeeper-server/src/test/java/org/apache/bookkeeper/client/BookieWriteLedgerTest.java

+        }
+
+        // Shutdown a bookie in the last ensemble and continue writing
+        ArrayList<BookieSocketAddress> ensemble1, ensemble2, ensemble1n;


ArrayList -> List // I changed this a couple of days ago

jvrao · 2018-08-03T16:43:27Z

Anyhow, a philosophical point, and unrelated to whether this change is good.

You are absolutely right. I will change my commit message.

jvrao · 2018-08-04T00:23:30Z

retest please

sijie · 2018-08-04T00:40:40Z

One thing to note is that this will trigger more load on zookeeper. For example, with Pulsar, you may have 100,000s topics each with a ledger open, that 100,000s new watches. So I would add a parameter to ClientConfiguration to make this optional, and off by default (to not add new load to unsuspecting users). @merlimat @sijie Opinions on this?

I think it is a good idea to have this behavior controlled by a flag.

eolivelli · 2018-08-04T08:34:27Z

I agree we should make this optional, at least in a first version.
Maybe in the future we will activate it by default.

These days I am checking the reader part. Currently we add a watch for each reader, this is a waste of resources in some use case and it slows down reads. But this is another story, I will start a separate thread.

There is a stale patch on ZK about 'persistent recursive watches' which will reduce a lot the expense of resources in cases like ours, but it is not finding enough support/consensus to be accepted. Maybe some of you could take a look.

apache/zookeeper#136

sijie · 2018-08-06T18:00:20Z

Currently we add a watch for each reader, this is a waste of resources in some use case and it slows down reads.

I am not sure it is a waste of resources and why it would slow down reads.

the readers have to be notified with the ensemble changes, otherwise the readers will be stuck at tailing.

There is a stale patch on ZK about 'persistent recursive watches' which will reduce a lot the expense of resources in cases like ours,

I am not sure a recursive watches will help in this case. because a reader only cares about the ledgers that it is interested in. it doesn't care about all the ledgers. that says if you have thousands of clients, that each reader only cares about one ledger, how "persistent recursive watches" will help this situation.

eolivelli · 2018-08-06T18:45:11Z

@sijie this is a very different use case from your experience with huge systems.

The case in which that watch is not very useful is this:

You have in the cluster only 1 or 2 bookies.
You are using BK for storing blobs for cery long time.

So you are never hitting ensemble changes, no automatic re-replication (only manual and in case of bad errors).
Ledgers metadata are very cold, they necwr change, so that watch will not be useful.
If you are randomly opening many ledgers and the closing them, you will create a lot of useless watches.

You can make many optimizations, like keeping open ledger handles in a cache, but those watches will have a cost and can be saved.

For the case in which there is an ensemble change the reader ledger handle can be reopened so that metadata are reread.

I will start another email thread, this is a bit off topic here

sijie · 2018-08-06T20:10:24Z

@sijie this is a very different use case from your experience with huge systems.

@eolivelli sorry. please open a thread for that. but my point is the existing behavior on readonly has its reason being there.

eolivelli · 2018-08-07T06:46:26Z

@sijie

the existing behavior on readonly has its reason being there.

100% agree ! Regular BK usage needs that watch.

eolivelli · 2018-09-16T16:37:52Z

retest this please

eolivelli · 2018-11-03T22:19:15Z

retest this please

eolivelli · 2018-11-03T22:19:46Z

This patch needs a rebase.
It is an important change

jvrao · 2019-01-03T23:30:29Z

retest this please

eolivelli · 2019-01-17T12:55:09Z

retest this please

sijie · 2019-02-21T09:48:50Z

@jvrao are you still working on this?

eolivelli · 2021-02-01T13:39:12Z

@jvrao do you have committed this patch into Salesforce fork ?

cc @dlg99

StevenLuMT · 2022-08-24T08:32:11Z

fix old workflow,please see #3455 for detail

hezhangjian · 2024-05-02T20:59:20Z

closed by no updates.

eolivelli approved these changes Aug 3, 2018

View reviewed changes

ivankelly requested changes Aug 3, 2018

View reviewed changes

jvrao requested review from athanatos and sijie August 3, 2018 16:45

Review-1

2b3dcd5

Fix compile issue

312d6e0

ivankelly mentioned this pull request Sep 13, 2018

Use immutable metadata in LedgerHandle #1646

Merged

sijie mentioned this pull request Feb 21, 2019

Write Ledger Handle doesn't listen to metadata changes #1579

Open

athanatos removed their request for review June 21, 2022 21:32

hezhangjian closed this May 2, 2024

(@bug W-5100764@) Write Ledger Handle listens for metadata changes #1580

(@bug W-5100764@) Write Ledger Handle listens for metadata changes #1580

Uh oh!

Conversation

jvrao commented Aug 3, 2018

Motivation

Changes

Uh oh!

eolivelli left a comment

Choose a reason for hiding this comment

Uh oh!

eolivelli commented Aug 3, 2018

Uh oh!

eolivelli commented Aug 3, 2018

Uh oh!

ivankelly left a comment

Choose a reason for hiding this comment

Uh oh!

ivankelly Aug 3, 2018

Choose a reason for hiding this comment

Uh oh!

jvrao Aug 19, 2018

Choose a reason for hiding this comment

Uh oh!

ivankelly Aug 20, 2018

Choose a reason for hiding this comment

Uh oh!

ivankelly Aug 3, 2018

Choose a reason for hiding this comment

Uh oh!

jvrao commented Aug 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jvrao commented Aug 4, 2018

Uh oh!

sijie commented Aug 4, 2018

Uh oh!

eolivelli commented Aug 4, 2018

Uh oh!

sijie commented Aug 6, 2018

Uh oh!

eolivelli commented Aug 6, 2018

Uh oh!

sijie commented Aug 6, 2018

Uh oh!

eolivelli commented Aug 7, 2018

Uh oh!

eolivelli commented Sep 16, 2018

Uh oh!

eolivelli commented Nov 3, 2018

Uh oh!

eolivelli commented Nov 3, 2018

Uh oh!

jvrao commented Jan 3, 2019

Uh oh!

eolivelli commented Jan 17, 2019

Uh oh!

sijie commented Feb 21, 2019

Uh oh!

eolivelli commented Feb 1, 2021

Uh oh!

StevenLuMT commented Aug 24, 2022

Uh oh!

hezhangjian commented May 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jvrao commented Aug 3, 2018 •

edited

Loading