Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BookKeeper switched to readonly after FileInfoDeletedException #1919

Open
aluccaroni opened this issue Jan 30, 2019 · 12 comments
Open

BookKeeper switched to readonly after FileInfoDeletedException #1919

aluccaroni opened this issue Jan 30, 2019 · 12 comments

Comments

@aluccaroni
Copy link
Contributor

BUG REPORT

  1. Please describe the issue you observed:
    During normal operation of a cluster of 3 BookKeepers one switched to READONLY mode. Usually we see this kind of errors when a disk become full, but this time we found out a "FileInfoDeletedException" inside the logs. We restarted the Bookie and everything returned to normal.

Apache BookKeeper 4.7.3
Java 11.0.2+7

  • What did you do?
    n/a

  • What did you expect to see?
    no error/no readonly mode

  • What did you see instead?
    The Bookie switched to readonly mode
    See stacktrace inside org.apache.bookkeeper.bookie.SortedLedgerStorage

19-01-30-09-48-40 org.apache.bookkeeper.bookie.FileInfo$FileInfoDeletedException: FileInfo already deleted org.apache.bookkeeper.bookie.FileInfo$FileInfoDeletedException: FileInfo already deleted at org.apache.bookkeeper.bookie.FileInfo.checkOpen(FileInfo.java:248) at org.apache.bookkeeper.bookie.FileInfo.checkOpen(FileInfo.java:242) at org.apache.bookkeeper.bookie.FileInfo.size(FileInfo.java:342) at org.apache.bookkeeper.bookie.IndexPersistenceMgr.updatePage(IndexPersistenceMgr.java:643) at org.apache.bookkeeper.bookie.IndexInMemPageMgr.grabLedgerEntryPage(IndexInMemPageMgr.java:470) at org.apache.bookkeeper.bookie.IndexInMemPageMgr.getLedgerEntryPage(IndexInMemPageMgr.java:435) at org.apache.bookkeeper.bookie.IndexInMemPageMgr.putEntryOffset(IndexInMemPageMgr.java:594) at org.apache.bookkeeper.bookie.LedgerCacheImpl.putEntryOffset(LedgerCacheImpl.java:96) at org.apache.bookkeeper.bookie.InterleavedLedgerStorage.processEntry(InterleavedLedgerStorage.java:433) at org.apache.bookkeeper.bookie.SortedLedgerStorage.process(SortedLedgerStorage.java:184) at org.apache.bookkeeper.bookie.EntryMemTable.flushSnapshot(EntryMemTable.java:251) at org.apache.bookkeeper.bookie.EntryMemTable.flush(EntryMemTable.java:205) at org.apache.bookkeeper.bookie.SortedLedgerStorage$1.run(SortedLedgerStorage.java:213) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834)

@eolivelli
Copy link
Contributor

eolivelli commented Jan 31, 2019

@athanatos @jvrao @reddycharan I think that this exception comes from one of your recent commits. Have you already seen this error?

Cc @sijie

@athanatos
Copy link

@eolivelli Haven't seen it. FileInfoDeletedException was added with the local consistency checker to indicate a call on a FileInfo after the gc had removed the ledger. In this case, I think SortedLedgerStorage is flushing entries for a ledger which no longer exists. What exception should be propagated in that case?

@eolivelli
Copy link
Contributor

Thank you for checking

The main problem is that bookie switched to readonly mode.

We are using official 4.7.3 release.

@athanatos
Copy link

@eolivelli Yeah, I was wrong on that one, I apparently added that bit to address a different race condition and later reused it. I think the bug is that grabCleanPage shouldn't be able to obtain a page in that state, I'm taking a look.

@eolivelli
Copy link
Contributor

@athanatos thank you

@athanatos
Copy link

@eolivelli How many times has this occurred?

@eolivelli
Copy link
Contributor

I am leaving the answer to @aluccaroni
I don't know.
I think they saw it only once

@athanatos
Copy link

Looks like EntryMemTable.flushSnapshot will tolerate a NoLedgerException, but prior to my patch I think that this race would have resulted in a ChannelClosedException in which case it would still have transitioned to RO. I think the right answer is for putEntryOffset to translate the FileInfoDeletedException into a NoLedgerException.

@aluccaroni
Copy link
Contributor Author

@athanatos we have seen it only once in 3 weeks (since we have put the v4.7.3 in production)

@ivankelly
Copy link
Contributor

@athanatos are you working on a fix for this?

@athanatos
Copy link

Yeah, sorry, was on vacation last week. I've got a patch I'll put up today.

@athanatos
Copy link

#1950

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants