-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-28801 WALs are not cleaned even after all entries are flushed #6179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: branch-2
Are you sure you want to change the base?
Conversation
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
@Apache9 this is for 2.x versions. Would you like to take a look? |
Mind explaining more here? The old logic seems correct, only if there are no un flushed entries, we can clean the WAL file... |
@Apache9 I have observed in production WAL files not being cleaned for days from the roll over. This was due to the WALs not being marked for close when they had unflushed entries.
During clean up in cleanOldLogs
The above log line was emitted for wal files that were rolled over days ago. Once its not marked close we will not be able to clean up there by leading lot of files to be processed during SCP .
My understanding is it is safe to mark them closed even with unflushed entries as there is check in cleanup which will not let the wal file to be cleaned up. |
What I mean is that, the logic here is correct, so in general, if everything works as expected, you can not archive the file, even if you mark it as closed right? The later sequence id check should prevent you from deleting the file. If not, as you described here, the WAL file can be deleted after you mark it as closed, then there must be something wrong, as there is some unflushed entries but the sequence id accounting tells us there is nothing unflushed? Thanks. |
@Apache9 Lets consider this scenario |
@Apache9 can you please review |
So this is a problem when refactoring? IIRC we abstract some common logic to AbstractFSWAL on branch-2, but the implementation is still a bit different from master and branch-3. We need to check the refactoring patch to see if it changed some semantics... |
OK, in the old time, if there are unflushed entries, we will throw IOException out and abort the region server... When optimizing the close logic, we finally changed the code to the current situation. I think the intention here is that, if there are still unflushed entries after closing writer, there should be an exception thrown out which aborts the region server. But looking at the implementation of getUnflushedEntriesCount, since we do not block others threads from adding new entries to the ring buffer, the getUnflushedEntriesCount could be greater than 0 even if there are no errors, so I think we need to take a look at the whole logic again, the isUnflushedEntries is not what we want now I'm afraid... |
@@ -390,10 +390,10 @@ protected void doReplaceWriter(Path oldPath, Path newPath, Writer nextWriter) th | |||
try { | |||
closeWriter(this.writer, oldPath, true); | |||
} finally { | |||
// closing this as there is no other chance we can set close to true | |||
// during clean up we check for unflushed entries | |||
markClosedAndClean(oldPath); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take a look at the code again, the isUnflushedEntries
should work as expected. The highestUnsyncedTxid
will only be increased in appendEntry method, so in the doReplaceWriter
, after arriving the safe point, the highestUnsyncedTxid
will not be changed.
I think there is another problem that in closeWriter, we will also check isUnflushedEntries and also increase the closeErrorCount, which may affect the logic here. But for normal case, closeWriter is executed in a background thread pool, so it is not safe to call isUnflushedEntries as it does not reflect the real state before closing... And why we put the close in background is that, even if it fails, it is not a big deal as we can make sure that all entries have already been flushed, a close failure only means we may fail to write the trailer.
I think here, we should only increase the closeErrorCount when closing in foreground, and also, when calling closeWriter, we should pass the result of isUnflushedEntries in, instead of calling it everytime when we want to check this state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Apache9 Thanks for reviewing.
The highestUnsyncedTxid will only be increased in appendEntry method, so in the doReplaceWriter, after arriving the safe point, the highestUnsyncedTxid will not be changed.
Is it safe to assume that the other threads (calling appendWrite) won't increment highestUnsyncedTxid till the close happens
I think here, we should only increase the closeErrorCount when closing in foreground, and also, when calling closeWriter, we should pass the result of isUnflushedEntries in, instead of calling it everytime when we want to check this state.
Yes that is correct accessing in the isUnflushedEntries in the background will cause issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you plan to fix the background close issue or you just want to add some logs in this issue and file new issues for addressing the background close issue?
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
@@ -393,6 +393,8 @@ protected void doReplaceWriter(Path oldPath, Path newPath, Writer nextWriter) th | |||
inflightWALClosures.remove(oldPath.getName()); | |||
if (!isUnflushedEntries()) { | |||
markClosedAndClean(oldPath); | |||
} else { | |||
LOG.debug("WAL has unflushed entries path: " + oldPath); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you add some metrics to track this event? Is this PR about adding logs to help debugging? is there more to the fix?
No description provided.