-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-28584 RS SIGSEGV under heavy replication load #6124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Deep clone the cells that are used to apply mutations on the local cluster. Some operations may still be in flight even as we fail to apply some other in-flight mutations and trigger failure handling including a release of the buffer underlying the cellScanner that is sourcing the cells.
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get this in to fix the crash issue first.
IIRC, we have a similar problem when async wal is enabled. The rpc call will finish before we actually finish the WAL writing and cause we write out corrupt wal entries.
Let me check how we deal with the problem there. I guess the same trick can be used here too.
Thanks @apurtell for the analyzing.
OK, in ServerCall class, we have a retainByWAL method, where we will count the extra references of the ServerCall, mainly the CellScanners. I think we can just change the method to retain, which means we want to retain it for other usage even after the rpc call is done, and also use it here. Anyway, since the current PR has been well tested in producation, I think we can apply it first, and open another issue for optimizing. Thanks. |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor nit though
* Deep clones the given cell if the cell supports deep cloning | ||
* @param cell the cell to be cloned | ||
* @return the cloned cell |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: javadoc to include @throws
Any updates here? Thanks. |
I think we should fix this in newer releases. If you are all OK, I could try to implement the reference counting way to solve the problem. @apurtell @virajjasani Thoughts? Thanks. |
I thought refcounting would be complex but am not opposed to it as a different solution. When and if we have that we could remove the copying. |
We would not need this change if #6263 solves the problem instead. |
Fixed by #6263 |
Clone the cells that are used to apply mutations on the local cluster. Some operations may still be in flight even as we fail to apply some other in-flight mutations and trigger failure handling including a release of the buffer underlying the cellScanner that is sourcing the cells.