-
Notifications
You must be signed in to change notification settings - Fork 963
(@bug W-5100764@) Write Ledger Handle listens for metadata changes #1580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Writer owns the metadata of the current ensemble ensemble only. Previous ensemble segments of the same ledger can freely modified by the replication worker. In the current code, write ledger handle, which allows reads also are blind sided by the non-disruptive ensemble changes performed by the replication worker. This could potentially direct readers to wrong destination, leading to unsuccessful reads. Fix this problem by placing a watcher on the zk node just like readOnlyLedgerHandle. When new metadata is received take the older (non-current) ensemble segment information and version number from the new metadata. Signed-off-by: Venkateswararao Jujjuri (JV) <vjujjuri@salesforce.com>
eolivelli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very interesting case.
Nice fix.
+1
@ivankelly shall we merge this change before you proceed with your refactor on immutable metadata ?
|
@jvrao there are checkstyle issues, we must fix them before merging to master 2018-08-03T04:37:29.104 [INFO] Starting audit... |
|
@jvrao there are also compilation errors, maybe it is just enough to rebase to current master 2018-08-03\T\04:35:51.103 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.7.0:testCompile (default-testCompile) on project bookkeeper-server: Compilation failure: Compilation failure: |
ivankelly
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree that the write owns the current ensemble. The consistent metadata store owns it, the writer can only make suggestions to change it. The writer should never use the metadata until that exact copy has been stored on the metadata store, which is why having mutable metadata is so dangerous. Anyhow, a philosophical point, and unrelated to whether this change is good.
One thing to note is that this will trigger more load on zookeeper. For example, with Pulsar, you may have 100,000s topics each with a ledger open, that 100,000s new watches. So I would add a parameter to ClientConfiguration to make this optional, and off by default (to not add new load to unsuspecting users). @merlimat @sijie Opinions on this?
| } | ||
| if (Version.Occurred.BEFORE == occurred) { // the metadata is updated | ||
| try { | ||
| bk.getMainWorkerPool().executeOrdered(ledgerId, new MetadataMerger(newMetadata)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need to merge. The metadata read from zookeeper must have the same last ensemble as the metadata currently being used, or else we're violating a whole load of properties. So in theory, you should be able to assign newMetadata to metadata. In practice it can be tricky with all the mutation that occurs while handling bookie failure. So leave the merge for now, but I will remove it once the ledger immutable metadata changes are in (I should have remaining patches up today or monday/tuesday next week)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ivankelly where are we now with the whole set of immutable changes?
There is another problem with this patch, the metadata is being accessed with and without lock in the code and that needs to be corrected too; may be covered as part of immutable changes. Also, we need to stop the writer as we discussed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| } | ||
|
|
||
| // Shutdown a bookie in the last ensemble and continue writing | ||
| ArrayList<BookieSocketAddress> ensemble1, ensemble2, ensemble1n; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ArrayList -> List // I changed this a couple of days ago
You are absolutely right. I will change my commit message. |
|
retest please |
I think it is a good idea to have this behavior controlled by a flag. |
|
I agree we should make this optional, at least in a first version. These days I am checking the reader part. Currently we add a watch for each reader, this is a waste of resources in some use case and it slows down reads. But this is another story, I will start a separate thread. There is a stale patch on ZK about 'persistent recursive watches' which will reduce a lot the expense of resources in cases like ours, but it is not finding enough support/consensus to be accepted. Maybe some of you could take a look. |
I am not sure it is a waste of resources and why it would slow down reads. the readers have to be notified with the ensemble changes, otherwise the readers will be stuck at tailing.
I am not sure a |
|
@sijie this is a very different use case from your experience with huge systems. The case in which that watch is not very useful is this:
So you are never hitting ensemble changes, no automatic re-replication (only manual and in case of bad errors). You can make many optimizations, like keeping open ledger handles in a cache, but those watches will have a cost and can be saved. For the case in which there is an ensemble change the reader ledger handle can be reopened so that metadata are reread. I will start another email thread, this is a bit off topic here |
@eolivelli sorry. please open a thread for that. but my point is the existing behavior on readonly has its reason being there. |
100% agree ! Regular BK usage needs that watch. |
|
retest this please |
1 similar comment
|
retest this please |
|
This patch needs a rebase. |
|
retest this please |
1 similar comment
|
retest this please |
|
@jvrao are you still working on this? |
|
fix old workflow,please see #3455 for detail |
|
closed by no updates. |
Writer owns the metadata of the current ensemble ensemble only.
Previous ensemble segments of the same ledger can freely modified
by the replication worker.
In the current code, write ledger handle, which allows reads also
are blind sided by the non-disruptive ensemble changes performed
by the replication worker. This could potentially direct readers
to wrong destination, leading to unsuccessful reads.
Fix this problem by placing a watcher on the zk node just like
readOnlyLedgerHandle. When new metadata is received take the
older (non-current) ensemble segment information and version
number from the new metadata.
Signed-off-by: Venkateswararao Jujjuri (JV) vjujjuri@salesforce.com
Descriptions of the changes in this PR:
Motivation
(Explain: why you're making that change, what is the problem you're trying to solve)
Changes
(Describe: what changes you have made)
Master Issue: #