-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-24833: Bootstrap should not delete the META table directory if … #2237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
@@ -915,6 +917,11 @@ private void finishActiveMasterInitialization(MonitoredTask status) | |||
this.tableDescriptors.getAll(); | |||
} | |||
|
|||
// check cluster Id stored in ZNode before, and use it to indicate if a cluster has been | |||
// restarted with an existing Zookeeper quorum. | |||
isClusterRestartWithExistingZNodes = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm... This check alone is not enough I fear.. Say there is a cluster recreate over existing data and so no zk node for clusterId. We will get true here. In next lines, we will create and write the zk node. Assume immediately after that, the HM restarted. Now after restart, when we come here, we have zk data for clusterId, and we will end up deleting the Meta dir !!
After cluster recreate we will come to know this boolean value before creation of zk node. So we need to record that info some how somewhere so that even if there is a restart before we reach to the Meta bootstrap. Even that info should be there in InitMetaProcedure also.. If there is an HM restart after the submit of this proc, the next time we know the boolean status.
Comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Anoop, if I get your suggestion correctly, should we add a master procedure (or a state in InitMetaProcedure) which record this boolean? this will have the assumption with the WAL exists when cluster recreates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya at proc level.. But before that itself one more thing.
Say there is a cluster recreate over existing data and so no zk node for clusterId. We will get true for 'isClusterRestartWithExistingZNodes '. In next lines, we will create and write the zk node for clusterId. Now assume after executing that lines, the HM restarted. So the zk node was created but the InitMetaProc was NOT submitted. Now after restart, when we come here, we have zk data for clusterId. So 'isClusterRestartWithExistingZNodes ' will become false. Now this time the InitMetaProc started and as part of that we will end up deleting the Meta dir.
So this says the need to keep this boolean info somewhere once we find that and even before creating the zk node for ClusterId. Am I making the concern clear this time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this says the need to keep this boolean info somewhere once we find that and even before creating the zk node for ClusterId. Am I making the concern clear this time?
ack and I got the concern.
but before I move to implement the proc level information, let's wait a bit on @Apache9 for the question on META's tableInfo
and partial meta
. The ideal case is that we may be able to use FSTableDescriptors.getTableDescriptorFromFs()
if found and can be readed to indicate meta is not partial.
I still feel a bit strange that we should deal with this in normal code path... Can we add a new HBCK option for this? |
Thanks @Apache9 , I want to agree with you to have a HBCK option, but one concern I have and keep struggling about making this automated instead of HBCK options. If one HBase cluster has hundred of tables with thousand of regions, how would the operator recovery the cluster? does he/she (offline/online) repair the meta table by scanning the storage on each region ? (instead we can just load the meta without rebuilding it?) Tbh, I felt bad to bring this meta table issue because normal HBase cluster does not assume Zookeeper (and WAL) could be gone after the cluster starts and restarts. [updated] for this PR/JIRA, mainly, I'm questioning what a |
Ya the biggest issue now is the META FS dir delete which happens after such a cluster recreate. An HBCK kind of tool to recreate the META data is too much IMO. Also we many other meta data also in META. So IMHO we should some how avoid this META delete. After cluster recreate this happens because in zk we dont have a META location and also no SCP in MasterProcWal which is targeting an RS which was holding META. So ya the same Q.. What is a partial META we refer? As part of META bootstrap we create the META tableInfo. That is an atomic op. So can be partial here? Sorry if am missing some thing.. Did not check code in detail. |
I think for the scenario here, we just need to write the cluster id and other things to zookeeper? Just make sure that the current code in HBase will not consider us as a fresh new cluster. We do not need to rebuild meta?
|
first of all, thanks Duo again.
So, let me confirm your suggestion, that means if we add one more field in ZNode, e.g. a boolean followup if ZK Znode data is used to determine if this is a fresh new cluster, can we skip the delete meta directory if
do you mean if we step back to just think on the definition of So, again, IMO data content in a table is sensitive ([updated] sorry if you guys think data in meta table is not sensitive), I'm proposing not to delete meta directory if possible (it's also like running a hbck to delete and rebuild). Based on our discussion here, IMO we have two proposal mentioned to define
seems 2) + *) should be the simplest solution, what do you guys think? |
I do not think you fully understand what I proposed from the beginning of the related topic. Honestly I do not know what you are going to fix here, what is the ZNode you want to add a boolean to? My suggestion is always that, introducing a new HBCK option, to reconstruct the content on zookeeper, so we know that this is not a fresh new cluster, and we will not schedule InitMetaProcedure. So for InitMetaProcedure itself, we do not need to consider whether the meta region is partial or not. You can see the code in finisheActiveMasterInitialization and also the code in AssignmentManager.start. In AssignmentManager.start, we will try to load the start of meta region from zookeeper, and if there is none, we will create a region node in offline state. And in finishActiveMasterInitialization, if we find that the state of meta region node is offline, we will schedule InitMetaProcedure. Is this clear enough? Thanks. |
Thanks Duo.. I was about to put the flow why we do InitMeta.. In case of the cluster recreate what @taklwu says, the WAL data itself not there as well as zk. So as what u said, if one has to do, then they have to do 2 things
|
To ur Q on the status on zk for clusterId @taklwu , The clusterId stored in zk itself is a PBed message. So we can always add the boolean optional field there. And later change also. So no need for new nodes or any.. But lets wait for a conclusion on the direction. Like still we dont have consensus on whether adding this as a top class HBase thing itself? (Bootstrap the cluster over existing data). If top class feature, we dont have to do many tricks to satisfy such needs. IMHO that is worth adding as all can get benefited. |
OK, so your point, is to add a feature that, we could start a cluster without no data on zk? I would vote a -1 here as I do not think it is possible. And even we could make it work for now, we can not force all later developpers to keep in mind that, all data on zk are not stable, you should not depend on them... It does not make sense. And on the InitMetaProcedure, maybe we could avoid deleting the whole meta, if there are HFiles in it. This is a possible improvement, to avoid damage the cluster if there are something wrong with zk. As for InitMetaProcedure, if we are at the first state, we can make sure that only directories are created and no actual data? Anyway, even if we add this check, I think we should fail the procedure and fail the start up, as we know that there are something wrong. Users should use HBCK or other tools to fix the inconsistency before moving forward. Thanks. |
The biggest issue now is META dir delete.. In one direction we are trying NOT to keep any persisted data in zk no? Till 2.1.7 even for META table also, we never had an issue even if that is not there.. The bootstrap code was assigning existing META table to new RS. Some way or the other, I dont mind, we need a solution for this. As per the existing code itself, am not sure why we need to delete the entire dir with assumption of partial bootstrap. There is a dir creation steps and then tableInfo create.. The tableInfo file create is atomic too. Every step we check if dir/file exists . Correct? If this META delete was not there things would have been easy. We have supported cluster drop and create with version <2.1.6.. But from 2.1.7 this is very diff because of this META delete. |
For 2.2+ we have a completely different way to do master initialization. This is exactly what we I said above, it is not easy to force all the developpers to follow the rule, that the data on zookeeper are not stable. Maybe it can work for now, but maybe it will be broken in the future... |
And for the problem here, it is why we introduce the InitMetaProcedure. For InitMetaProcedure itself, the assumption is that it will do the initialization work, it will be very strange that, we schedule a 'InitMeta' procedure, but we already have a meta in place? Why not just do not schedule the InitMetaProcedure at the first place? |
InitMetaProcedure was there before also.. Only in 2.1.7, this delete of the META dir added. Am asking abt that only @Apache9 . That is the major concern. The init proc was before doing FS init if the meta dir is not yet there. And then init the assign of the META in the cluster. Now it will do both what so ever, the cur status of the meta dir. The req is a change for that. Is that some thing u veto? |
I do not see why we need to change this. Consider the usage of InitMetaProcedure, if there is a meta directory there, it must be 'partial'. And even it is not partial, we are still safe to delete it, as there should be no data in it. If this is not the case, we must have inconsistency in our cluster. For safety, I think we could skip the deletion of the meta directory, but we should still fail the procedure and thus fail the initialization of master. Users should fix the inconsistency before starting again. |
Anyway, let's get back to the initial goal? I think it is to support restart a cluster with nothing on zookeeper? Let's have an overall design first? As I said above, it is not easy to force all developpers to follow this rule. If we want to do this, we need to find a suitable way to force the later developpers know, they should not depend on the data on zookeeper... |
I agreed with you that we may not be ready to totally skip relying on the data stored on zookeeper, that's definitely a boarder discussion on what HBase currently depends on Zookeeper (branch-2 and master), especially if data on Zookeeper could be ephemeral or removed. (I thought we're in the progress of moving data into ROOT region, aren't we ? e.g. Proc-v2 Also, my initial goal is that the meta data/directory should not be deleted if possible, and we're trying to provide a persisted condition not to always delete meta if it's not sorry that I may be newbie on the proc-v2 and zk data, should we start a thread on the dev@ list to discuss about the following ? (my goal is to find a consensus how we can move this PR further)
|
I think we'd better start from a high level design, on how we could start a cluster with no data on zookeeper. My proposal is always, introducing a new option in HBCK2 to reconstruct the data on zookeeper, but you actually start the cluster. If you guys think it is possible to do this in normal code path, please give an overall design. Open a google doc? So others could comments on it? For deleting meta, I agree that we should not delete it if it is not 'partial'. But as I said above many times, if there is no inconsitency, scheduling an InitMetaProcedure implies that we do not have a meta table yet, so in InitMetaProcedure it is always safe to delete the meta directory. So even if we add a check in InitMetaProcedure to determine whether the meta table is 'real' partial, the result should be an ERROR log to tell users something crtical happens, please fix the inconsistency before moving forward. I do not think this is what you guys want here? As you want to start the HMaster and recover from the inconsistency? |
IMO the title of the google design may be going on the cloud use cases that has been restarting on just HFiles without WAL and without Zookeeper (but all the user tables are flushed and disabled before terminating it). I knew that may not be mentioned in the book tutorial, and it may be a good time to clarify how that cases are actually working and some users has been using in HBase-1.4.x and HBase maybe before 2.1.7. Then we can see what the gaps maybe now in [updated]branch-2.4+ (we don't need that in 2.3.0 and 2.2.5 becase
what does |
The inconsistency is between the data on filesystem and zookeeper. In general, after the meta table has been initialized, it should have a directory on the file system, and also a znode to indicate its state on zookeeper. The problem here is that, in your scenario, the data on zookeeper are all gone. This is what I called 'inconsistency'. And I've also mentioned several times, it works for several versions does not mean it will work for all the versions. They are just different implementation choices, and you can not force the developpers to not use the data on zookeeper to determine what to do next. So I do not think it worth to ask on the dev list about 'whether' the InitMetaProcedure should delete meta without checking 'partial' or not. The problem here is you delete the state znode for meta region on zookeeper, not a problem for InitMetaProcedure. Either you add it back through HBCK2, or you should have a overall design on how we could fix the problem, without using HBCK2. |
I apologized for the dev@ email, but I was thinking differently overnight about your suggestion (sorry that I reread many times until I found the gap this morning)
didn't the coming up master region that store the meta location in HBASE-24408 and PR#1746 solve our conflict of interests that we don't need to relaying on ZK for getting the server name (old host) for the meta region ? such even if we don't have the ZK, we can move on and don't submit the InitMetaProcedure because the state of the meta region is not if you confirm above, I may say bring this PR and keep highlighting the zookeeper discussion is my mistake and I should have learnt the master region ahead of this PR. (then we just need to move to the coming up version, and we can still restart on the cloud use cases) [update] I will follow your suggestion and throws a exception if we find the meta table is not empty/partial |
HBASE-24388 is only on a feature branch and we are still discussing whether to go this way... There is another solution to introduce a general root table, then we will have InitRootProcedure and the state of the root region is still on zookeeper. Of course I'm a sponsor of HBASE-24388 as it is implemented by me...
Yes let's do this first to prevent damaging a cluster. And in general, I do not think it is very hard to introduce a tool for your scenario? Just put a dummy meta state node on zookeeper can solve the problem. And for the regions, as said before, you'd better disable them allbefore shutting down the old cluster, and after starting the new cluster, enable them. Even if you do not do this, I think you could just scan the meta regions, to find out all the recorded region servers and schedule HBCKSCP for them to get the cluster back. |
ead2eee
to
98119eb
Compare
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
if (!isPartial) { | ||
throw new IOException("Meta table is not partial, please sideline this meta directory " | ||
+ "or run HBCK to fix this meta table, e.g. rebuild the server hostname in ZNode for the " | ||
+ "meta region"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Apache9, we're failing the InitMetaProcedure
with an IOException
, and HMaster will fail the master startup if InitMetaProcedure
is FAILED
with an exception.
Still, alternatively, we could continue the bootstrap without throwing (but this is not good as you recommended)
so, do you think this change align with your comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general the approach is fine. The only concern is the implementation. I do not think InitMetaProcedure support rollback, so what will happen if we throw exception here? You will see a ERROR log in the output to say that the procedure does not support rollback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[updated] yeah, I saw the UnsupportedOperationException
and the timeout in the unit test logs, then the master didn't stop, such that I added a section in HMaster to catch this exception and fail the master startup with another IOException.
Did I do it wrong ? or any idea how I can fail the procedure right?
// wait meta to be initialized after we start procedure executor
if (initMetaProc != null) {
initMetaProc.await();
if (initMetaProc.isFailed() && initMetaProc.hasException()) {
throw new IOException("Failed to initialize meta table", initMetaProc.getException());
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would worry that the ProcedureExecutor is going to resubmit the IMP again. You might have to, instead, let IMP finish without an error and determine via some side-effect (e.g. set a variable in HMaster?) if we bailed out of IMP and then fail at that level. Truthfully, I don't remember what happens by default (if all procedures which throw an exception are implicitly retried.
if (!isPartial) { | ||
throw new IOException("Meta table is not partial, please sideline this meta directory " | ||
+ "or run HBCK to fix this meta table, e.g. rebuild the server hostname in ZNode for the " | ||
+ "meta region"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general the approach is fine. The only concern is the implementation. I do not think InitMetaProcedure support rollback, so what will happen if we throw exception here? You will see a ERROR log in the output to say that the procedure does not support rollback?
I have been watching this for a while, and I still don't have good clarity on the following case.
@Apache9 Can you explain how a partial meta table will damage a cluster? I have been looking at the code for the InitMetaProcedure and the only thing that I can see that is not idempotent is the directory creation for the NS (and the delete of the meta table, but that is being used as the fix for making things idempotent). If InitMetaProcedure is idempotent, there should be no need to delete the directory. If we restart a cluster and schedule an InitMetaProcedure:
After that case, if there is any inconsistency in the meta table, that can be handled via HBCK as usual. A couple tenants we should always have:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, it is better than just removing meta with data. Let's do this first.
We can see how to improve this in the future.
I am currently -1 on the implementation pending answers to my questions. I have not been able to find any valid failure modes that will cause data loss or corrupting actions if hbase:meta is onlined. If this is true, we can make InitMetaProcedure idempotent (see above) and we can handle any InitMetaProcedure failure condition. In my investigation I was actually looking at the master branch which contains an additional state for INIT_META_CREATE_NAMESPACES. There are additional bugs associated with this approach that I will open JIRAs for. I will not bring the additional complexity of folding the namespace table into the meta table into this discussion. |
so.....how can we get consensus on this PR or this series of idempotent issues (for InitMetaProcedure) ? I don't mind to break them into more tasks as Zach has created those followup bugs (HBASE-24922 and HBASE-24923), but I don't see a clear agreement among everyone from whether we should continue the bootstrap or fail hard on the bootstrap when we find the meta table in InitMetaProcedure. how do we move further? |
@z-york Is clumsy operator deleting the meta location znode by mistake a valid failure mode? Throw in a Master restart soon after (Will IMP run? If so, double assign of meta if the rest of the cluster was up at the time which can make for dataloss as the two meta Regions fight over what hfiles make up the meta Region). The dataloss will be worse though if we just blanket delete meta dir if it exists already when IMP runs. bq. ....but I don't see a clear agreement among everyone from whether we should continue the bootstrap or fail hard on the bootstrap when we find the meta table in InitMetaProcedure. Dunno. Its called IMP. When it runs, there is supposed to be no meta. If there is, then something is not right: i.e. see above clumsy operator. Shouldn't remove the meta dir though if exists already? Fail the master startup? HBCK2 time? Could do Zach's idea of making it idempotent but IMP scope does not cover writing location in zk so can't have this as a 'step' in IMP. What about adding extra step before assign where we wait asking Master a question about the cluster state such as if any of the RSs that are checking in have Regions on them; i.e. if Regions already assigned, if an already 'up' cluster? Would that help? You fellows don't want to have to run a script beforehand? ZK is up and just put an empty location up or ask Master or hbck2 to do it for you? You just want the cluster to start up over hfiles? Thanks. |
sorry for the delayed response.
no this is a special case that we have been supporting, where the HBase cluster freshly restarts on top of only flushed HFiles and does not come with WAL or ZK. and we admitted that it's a bit different from the community stand points that WAL and ZK must be both pre-existed when master or/and RSs start on existing HFiles to resume the states left from any procedures.
having extra step to check if RSs has any assigned may help, but I don't know if we can do that before the server manager find any region server is online.
I think HBCK/HBCK2 is performing online repairing, there are few concerns we're having
Personally, it's fine to me with throwing exception as Duo suggested, and on our side we need to find a way to continue if we see this exception. then we improve it in the future when we need to completely getting rid of the extra step on hbck. So, for this PR, if we don't hear any other critical suggestion, maybe I will leave it "close" as unresolved, do you guys agree ? |
|
thanks Duo for your understanding and comments ;) and happy holidays! @z-york your point on idempotent is good, but at this point will you agree that we create a follow up on first add exception and discuss/handle this data loss issues in a different JIRA ? will you reconsider to change your -1 vote on this particular PR ? (btw I can rebase it after we agree this change) |
I think we can agree to disagree on the automatic/idempotent IMP, we can propose a plan for that if I have more time to devote to it in the future. For now we can keep that as a patch. I think that this PR can still go in with throwing an exception instead of deleting meta (I don't see any reason to submit a separate PR, let's keep this discussion). I think we should fail instead of allowing an automatic delete of the meta directory (or at least have an option to fail and not delete) since we are continuing to add more metadata into meta and it will become more and more costly to rebuild. I have seen many cases where operators clear out ZK nodes + restart master to unblock some assignment issues, but admittedly that is on 1.x versions of HBase, I think the recovery options might be better in 2.x. We already have offline meta repair that I believe should be able to solve these issues if an exception is thrown. btw @saintstack for the double assigned meta scenario wouldn't the IO fencing/lease on the meta WAL handle that? Or will it try to write to a unique WAL each assignment? Just curious, not blocking this PR. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
I have fixed the conflicts, probably will push it in two days and see if you guys have any comments. |
} finally { | ||
if (!isPartial) { | ||
throw new IOException("Meta table is not partial, please sideline this meta directory " | ||
+ "or run HBCK to fix this meta table, e.g. rebuild the server hostname in ZNode for the " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mention the HBCK sub command to help fix this? (Is this offline meta repair?) Since we're adding a new failure mode, it should be clear how to fix this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of specific commands to run, what about referencing something in the book (which could be updated if "what to do" changes across release versions)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point on documentation, I need more time to figure out if there is existing command(s)/option(s) other than manually sidelining the meta directory to a different location. let's mark it as a blocker/requirement before I merging this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://issues.apache.org/jira/browse/HBASE-24833?focusedCommentId=17409167&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17409167 rough sketch what I was able to do to fix it.
hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/InitMetaProcedure.java
Outdated
Show resolved
Hide resolved
…it's not partial Add a check on meta bootstrap to skip removing meta table directory When entering state INIT_META_WRITE_FS_LAYOUT. here, if any valid HFiles are found in the meta table directory, it is not partial and throws an Exception to crash the Master startup.
082d601
to
b5010b2
Compare
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 this looks reasonable
@joshelser should we wait till add the documentation in this commit ? or should we have a followup PR? |
merged and if we need a new documentation, we can have a followup JIRA |
…it's not partial
Add a check on meta bootstrap to skip removing meta table directory
if ZK data does not exist when hmaster restart. here the existence
of clusterID in ZK indicate if the meta is partial if we hit the
INIT_META_WRITE_FS_LAYOUT in InitMetaProcedure