-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce config to skip non-recoverable data-ledger #1046
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a couple of suggestions
@@ -51,6 +51,7 @@ | |||
private double throttleMarkDelete = 0; | |||
private long retentionTimeMs = 0; | |||
private long retentionSizeInMB = 0; | |||
private boolean skipNonRecoverableLedger; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about naming it automaticallySkipNonRecoverableData
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or abbreviated into auto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, I will change it.
@@ -1956,7 +1960,7 @@ boolean ledgerExists(long ledgerId) { | |||
return ledgers.get(ledgerId) != null; | |||
} | |||
|
|||
long getNextValidLedger(long ledgerId) { | |||
Long getNextValidLedger(long ledgerId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why changing type? Do we need null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, because long
was causing NPE if we try to get next ledger after very last ledger in the list. at that time ledgers.ceilingKey(ledgerId + 1)
returns null and it gives NPE if we try to cast it in long so, changed the data type to Long and handling at caller.
switch (rc) { | ||
case BKException.Code.NoSuchLedgerExistsException: | ||
case BKException.Code.ReadException: | ||
case BKException.Code.LedgerRecoveryException: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be temporary, if multiple bookies are down and recovery is now failing, though it could succeed later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this may make sense to consider as isBkErrorNotRecoverable
for ManagedCursor and not for ManagedLedger.? I think we can create two separate methods isBkErrorNotRecoverable
for ManagedCursor and ManagedLedger as they may need different flexibility level?
public static boolean isBkErrorNotRecoverable(int rc) { | ||
switch (rc) { | ||
case BKException.Code.NoSuchLedgerExistsException: | ||
case BKException.Code.ReadException: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This I'm not 100% sure is non-recoverable, we need to check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I think we can remove it. let me verify in which case we can see ReadException
and we can remove it if we don't need it.
@merlimat addressed comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
334fcd2
to
2dbda65
Compare
…or (#4818) ### Motivation In #1046, we have added a flag (`autoSkipNonRecoverableData`) and mechanism to recover cursor if ledger data is deleted. However, expiery-monitor doesn't use that flag and it gets stuck when it finds non-recoverable ml-error while cleaning up expired message. ### Modification Expiry-monitor can skip non-recoverable managed-ledger exception (eg: data/ledger doesn't exist anymore) when `autoSkipNonRecoverableData` flag is enabled.
Motivation
Recently, we have seen that due to unexpected ledger deletion at BK, Broker was seeing NoSuchEntryException/NoLedgerException while reading data-ledger.
In this case, it requires manual cleanup to delete those ledgers from managed-ledger list and performing it for large number of topics across many brokers will be very complicated. In this case, those ledgers are not recoverable so, there should be a mechanism to skip non-recoverable data-ledgers from managed-ledger list.
Modifications
Result
Broker will be more resilient when it is not able to read non-recoverable data-ledger and it can avoid manual cleanup in cluster.