HBASE-24833: Bootstrap should not delete the META table directory if … #2237

taklwu · 2020-08-10T22:19:49Z

…it's not partial

Add a check on meta bootstrap to skip removing meta table directory
if ZK data does not exist when hmaster restart. here the existence
of clusterID in ZK indicate if the meta is partial if we hit the
INIT_META_WRITE_FS_LAYOUT in InitMetaProcedure

Apache-HBase · 2020-08-10T23:05:27Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	0m 42s	Docker mode activated.
		_ Prechecks _
+1 💚	dupname	0m 0s	No case conflicting files found.
+1 💚	hbaseanti	0m 0s	Patch does not have any anti-patterns.
+1 💚	@author	0m 0s	The patch does not contain any @author tags.
		_ branch-2 Compile Tests _
+1 💚	mvninstall	4m 47s	branch-2 passed
+1 💚	checkstyle	1m 48s	branch-2 passed
+1 💚	spotbugs	2m 48s	branch-2 passed
		_ Patch Compile Tests _
+1 💚	mvninstall	4m 10s	the patch passed
+1 💚	checkstyle	1m 31s	the patch passed
+1 💚	whitespace	0m 0s	The patch has no whitespace issues.
+1 💚	hadoopcheck	15m 53s	Patch does not cause any errors with Hadoop 3.1.2 3.2.1.
+1 💚	spotbugs	2m 56s	the patch passed
		_ Other Tests _
+1 💚	asflicense	0m 16s	The patch does not generate ASF License warnings.
		44m 25s

Subsystem	Report/Notes
Docker	Client=19.03.12 Server=19.03.12 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/1/artifact/yetus-general-check/output/Dockerfile
GITHUB PR	#2237
JIRA Issue	HBASE-24833
Optional Tests	dupname asflicense spotbugs hadoopcheck hbaseanti checkstyle
uname	Linux 96268feba611 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	branch-2 / `17a0c2a`
Max. process+thread count	94 (vs. ulimit of 12500)
modules	C: hbase-server U: hbase-server
Console output	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/1/console
versions	git=2.17.1 maven=(cecedd343002696d0abb50b32b541b8a6ba2883f) spotbugs=3.1.12
Powered by	Apache Yetus 0.11.1 https://yetus.apache.org

This message was automatically generated.

Apache-HBase · 2020-08-11T01:10:24Z

💔 -1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	2m 48s	Docker mode activated.
-0 ⚠️	yetus	0m 7s	Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
		_ Prechecks _
		_ branch-2 Compile Tests _
+1 💚	mvninstall	5m 11s	branch-2 passed
+1 💚	compile	1m 15s	branch-2 passed
+1 💚	shadedjars	6m 38s	branch has no errors when building our shaded downstream artifacts.
-0 ⚠️	javadoc	0m 41s	hbase-server in branch-2 failed.
		_ Patch Compile Tests _
+1 💚	mvninstall	4m 33s	the patch passed
+1 💚	compile	1m 17s	the patch passed
+1 💚	javac	1m 17s	the patch passed
+1 💚	shadedjars	6m 49s	patch has no errors when building our shaded downstream artifacts.
-0 ⚠️	javadoc	0m 44s	hbase-server in the patch failed.
		_ Other Tests _
-1 ❌	unit	137m 7s	hbase-server in the patch failed.
		169m 27s

Subsystem	Report/Notes
Docker	Client=19.03.12 Server=19.03.12 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/1/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR	#2237
JIRA Issue	HBASE-24833
Optional Tests	javac javadoc unit shadedjars compile
uname	Linux 79f0b2d535a9 4.15.0-65-generic #74-Ubuntu SMP Tue Sep 17 17:06:04 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	branch-2 / `17a0c2a`
Default Java	2020-01-14
javadoc	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/1/artifact/yetus-jdk11-hadoop3-check/output/branch-javadoc-hbase-server.txt
javadoc	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/1/artifact/yetus-jdk11-hadoop3-check/output/patch-javadoc-hbase-server.txt
unit	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/1/artifact/yetus-jdk11-hadoop3-check/output/patch-unit-hbase-server.txt
Test Results	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/1/testReport/
Max. process+thread count	3679 (vs. ulimit of 12500)
modules	C: hbase-server U: hbase-server
Console output	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/1/console
versions	git=2.17.1 maven=(cecedd343002696d0abb50b32b541b8a6ba2883f)
Powered by	Apache Yetus 0.11.1 https://yetus.apache.org

This message was automatically generated.

Apache-HBase · 2020-08-11T02:29:31Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	1m 23s	Docker mode activated.
-0 ⚠️	yetus	0m 7s	Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
		_ Prechecks _
		_ branch-2 Compile Tests _
+1 💚	mvninstall	4m 17s	branch-2 passed
+1 💚	compile	1m 5s	branch-2 passed
+1 💚	shadedjars	5m 54s	branch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 43s	branch-2 passed
		_ Patch Compile Tests _
+1 💚	mvninstall	4m 0s	the patch passed
+1 💚	compile	1m 0s	the patch passed
+1 💚	javac	1m 0s	the patch passed
+1 💚	shadedjars	5m 44s	patch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 39s	the patch passed
		_ Other Tests _
+1 💚	unit	221m 44s	hbase-server in the patch passed.
		248m 32s

Subsystem	Report/Notes
Docker	Client=19.03.12 Server=19.03.12 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/1/artifact/yetus-jdk8-hadoop2-check/output/Dockerfile
GITHUB PR	#2237
JIRA Issue	HBASE-24833
Optional Tests	javac javadoc unit shadedjars compile
uname	Linux 257932d04b1f 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	branch-2 / `17a0c2a`
Default Java	1.8.0_232
Test Results	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/1/testReport/
Max. process+thread count	2313 (vs. ulimit of 12500)
modules	C: hbase-server U: hbase-server
Console output	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/1/console
versions	git=2.17.1 maven=(cecedd343002696d0abb50b32b541b8a6ba2883f)
Powered by	Apache Yetus 0.11.1 https://yetus.apache.org

This message was automatically generated.

anoopsjohn · 2020-08-11T12:52:56Z

hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java

@@ -915,6 +917,11 @@ private void finishActiveMasterInitialization(MonitoredTask status)
      this.tableDescriptors.getAll();
    }

+    // check cluster Id stored in ZNode before, and use it to indicate if a cluster has been
+    // restarted with an existing Zookeeper quorum.
+    isClusterRestartWithExistingZNodes =


Hm... This check alone is not enough I fear.. Say there is a cluster recreate over existing data and so no zk node for clusterId. We will get true here. In next lines, we will create and write the zk node. Assume immediately after that, the HM restarted. Now after restart, when we come here, we have zk data for clusterId, and we will end up deleting the Meta dir !!
After cluster recreate we will come to know this boolean value before creation of zk node. So we need to record that info some how somewhere so that even if there is a restart before we reach to the Meta bootstrap. Even that info should be there in InitMetaProcedure also.. If there is an HM restart after the submit of this proc, the next time we know the boolean status.
Comments?

Thanks Anoop, if I get your suggestion correctly, should we add a master procedure (or a state in InitMetaProcedure) which record this boolean? this will have the assumption with the WAL exists when cluster recreates.

Ya at proc level.. But before that itself one more thing.
Say there is a cluster recreate over existing data and so no zk node for clusterId. We will get true for 'isClusterRestartWithExistingZNodes '. In next lines, we will create and write the zk node for clusterId. Now assume after executing that lines, the HM restarted. So the zk node was created but the InitMetaProc was NOT submitted. Now after restart, when we come here, we have zk data for clusterId. So 'isClusterRestartWithExistingZNodes ' will become false. Now this time the InitMetaProc started and as part of that we will end up deleting the Meta dir.
So this says the need to keep this boolean info somewhere once we find that and even before creating the zk node for ClusterId. Am I making the concern clear this time?

So this says the need to keep this boolean info somewhere once we find that and even before creating the zk node for ClusterId. Am I making the concern clear this time?

ack and I got the concern.

but before I move to implement the proc level information, let's wait a bit on @Apache9 for the question on META's tableInfo and partial meta. The ideal case is that we may be able to use FSTableDescriptors.getTableDescriptorFromFs() if found and can be readed to indicate meta is not partial.

Apache9 · 2020-08-11T14:12:35Z

I still feel a bit strange that we should deal with this in normal code path...

Can we add a new HBCK option for this?

taklwu · 2020-08-11T19:10:03Z

Thanks @Apache9 , I want to agree with you to have a HBCK option, but one concern I have and keep struggling about making this automated instead of HBCK options. If one HBase cluster has hundred of tables with thousand of regions, how would the operator recovery the cluster? does he/she (offline/online) repair the meta table by scanning the storage on each region ? (instead we can just load the meta without rebuilding it?)

Tbh, I felt bad to bring this meta table issue because normal HBase cluster does not assume Zookeeper (and WAL) could be gone after the cluster starts and restarts.

[updated] for this PR/JIRA, mainly, I'm questioning what a partial meta should be (e.g. it's now relying on the state of InitMetaProcedure instead of the data of meta table), any thoughts ?

anoopsjohn · 2020-08-12T07:47:52Z

Ya the biggest issue now is the META FS dir delete which happens after such a cluster recreate. An HBCK kind of tool to recreate the META data is too much IMO. Also we many other meta data also in META. So IMHO we should some how avoid this META delete. After cluster recreate this happens because in zk we dont have a META location and also no SCP in MasterProcWal which is targeting an RS which was holding META. So ya the same Q.. What is a partial META we refer? As part of META bootstrap we create the META tableInfo. That is an atomic op. So can be partial here? Sorry if am missing some thing.. Did not check code in detail.

Apache9 · 2020-08-13T01:51:23Z

Thanks @Apache9 , I want to agree with you to have a HBCK option, but one concern I have and keep struggling about making this automated instead of HBCK options. If one HBase cluster has hundred of tables with thousand of regions, how would the operator recovery the cluster? does he/she (offline/online) repair the meta table by scanning the storage on each region ? (instead we can just load the meta without rebuilding it?)

I think for the scenario here, we just need to write the cluster id and other things to zookeeper? Just make sure that the current code in HBase will not consider us as a fresh new cluster. We do not need to rebuild meta?

Tbh, I felt bad to bring this meta table issue because normal HBase cluster does not assume Zookeeper (and WAL) could be gone after the cluster starts and restarts.

[updated] for this PR/JIRA, mainly, I'm questioning what a partial meta should be (e.g. it's now relying on the state of InitMetaProcedure instead of the data of meta table), any thoughts ?
After introducing proc-v2, we reply on it to record the state of a multi-step operation. Here, I believe the problem is that, we schedule an InitMetaProcedure when we do have a meta table in place. For the InitMetaProcedure, the assumption is that, if we found that the meta table directory is there, then it means the procedure itself has crashed before finishing the creation of meta table, i.e, the meta table is 'partial'. So it is safe to just remove it and create again. I think this is a very common trick in distributed system for handling failures?

taklwu · 2020-08-13T04:47:41Z

first of all, thanks Duo again.

I think for the scenario here, we just need to write the cluster id and other things to zookeeper? Just make sure that the current code in HBase will not consider us as a fresh new cluster. We do not need to rebuild meta?

So, let me confirm your suggestion, that means if we add one more field in ZNode, e.g. a boolean completedMetaBoostrap, if we find both clusterId and completedMetaBoostrap in ZK, we will not delete meta directory ?

followup if ZK Znode data is used to determine if this is a fresh new cluster, can we skip the delete meta directory if clusterId and completedMetaBoostrap are never set but we found meta directory? this is the cloud use cases which we don't have ZK to make the decision; such we don't know if the meta is partial, and IMO, we should just leave the meta directory and if anything bad happens, the operator can still run HBCK. (if we do the other way around and always delete the meta, then we're losing the possibility the cluster can heal itself, and we cannot confirm if this is partial, doesn't it?)

For the InitMetaProcedure, the assumption is that, if we found that the meta table directory is there, then it means the procedure itself has crashed before finishing the creation of meta table, i.e, the meta table is 'partial'. So it is safe to just remove it and create again. I think this is a very common trick in distributed system for handling failures?

do you mean idempotent is the trick ? InitMetaProcedure may be idempotent and can make hbase:meta online (as a empty table), but I don't think if the cluster/HM itself is idempotent automatically; and yeah, it can rebuild the data content of the original meta with the help of HBCK, but just if HM continues the flow with some existing data, e.g. the namespace table (sorry for branch-2 we have namespace table) and HM restart with a empty meta, based on the experiment I did, the cluster hangs and HM cannot be initialized.

if we step back to just think on the definition of partial meta, it would be great if the meta table itself can tell if it's partial, because it's still a table in HBase and HFiles are immutable. e.g. can we tell if a user table is partial by looking at its data? I may be wrong, but it seems like we're not able to tell from HFiles itself, and we need ZK and WAL to define it.

So, again, IMO data content in a table is sensitive ([updated] sorry if you guys think data in meta table is not sensitive), I'm proposing not to delete meta directory if possible (it's also like running a hbck to delete and rebuild).

Based on our discussion here, IMO we have two proposal mentioned to define partial meta .

add a boolean in WAL like a proc-level data
write a boolean in ZNode to tell if the bootstrap completes
*. no matter we choose 1) and 2) above, we have an additional condition, if we don't find any WAL or ZK about this condition, we should not delete the meta table.

seems 2) + *) should be the simplest solution, what do you guys think?

Apache9 · 2020-08-13T06:04:03Z

I do not think you fully understand what I proposed from the beginning of the related topic. Honestly I do not know what you are going to fix here, what is the ZNode you want to add a boolean to?

My suggestion is always that, introducing a new HBCK option, to reconstruct the content on zookeeper, so we know that this is not a fresh new cluster, and we will not schedule InitMetaProcedure. So for InitMetaProcedure itself, we do not need to consider whether the meta region is partial or not.

You can see the code in finisheActiveMasterInitialization and also the code in AssignmentManager.start. In AssignmentManager.start, we will try to load the start of meta region from zookeeper, and if there is none, we will create a region node in offline state. And in finishActiveMasterInitialization, if we find that the state of meta region node is offline, we will schedule InitMetaProcedure.
So what you need to do here, is to put the meta region znode to zookeeper, before you restart the hbase cluster. So we will not schedule InitMetaProcedure again.

Is this clear enough?

Thanks.

anoopsjohn · 2020-08-13T08:42:58Z

Thanks Duo.. I was about to put the flow why we do InitMeta.. In case of the cluster recreate what @taklwu says, the WAL data itself not there as well as zk. So as what u said, if one has to do, then they have to do 2 things

When the initial cluster was dropped, before that the WAL fs (HDFS backed by managed disk in cloud) need to be backedup. Also the zk data (the meta server location) to be backed up
When the cluster recreated from existing data, some tool (hbck or some thing) need to recreate the zk node and put that old location value. Before that it has to recover back the WAL FS data onto the new HDFS cluster.
The WAL fs data backup and restore make sense. But IMHO the zk data thing looks another hack. Till 2.1.6 this was not needed. Even if the meta location is not there zk, the init meta will kick in and that will create meta region from existing data and assign to some RS. But the meta cleanup make the existing entire data to be deleted. I think that is not good. We are adding more things to META table these days. The NS info itself is in another CF in META table.. There is some other discussion around adding the committed HFiles data into META (This is not concluded but looks like we keep increasing the responsibility of META table). So dropping all these information is not that good.
So my thinking was to add such cluster recreate as a 1st class feature in HBase itself. This can be used by anyone anywhere. As long as the data is persisted, we can drop the cluster and recreate later. Now for that 2 blockers and the biggest one is this META dir delete as part of InitMetaProc. I agree to the intent of adding that cleanup. But now if we have to have this support also how can we avoid this delete? Recreate entire META using some hbck options should be the very last option to think off IMO. If we can solve other ways why not?

anoopsjohn · 2020-08-13T08:48:53Z

To ur Q on the status on zk for clusterId @taklwu , The clusterId stored in zk itself is a PBed message. So we can always add the boolean optional field there. And later change also. So no need for new nodes or any.. But lets wait for a conclusion on the direction. Like still we dont have consensus on whether adding this as a top class HBase thing itself? (Bootstrap the cluster over existing data). If top class feature, we dont have to do many tricks to satisfy such needs. IMHO that is worth adding as all can get benefited.

Apache9 · 2020-08-13T09:03:01Z

OK, so your point, is to add a feature that, we could start a cluster without no data on zk? I would vote a -1 here as I do not think it is possible. And even we could make it work for now, we can not force all later developpers to keep in mind that, all data on zk are not stable, you should not depend on them... It does not make sense.

And on the InitMetaProcedure, maybe we could avoid deleting the whole meta, if there are HFiles in it. This is a possible improvement, to avoid damage the cluster if there are something wrong with zk. As for InitMetaProcedure, if we are at the first state, we can make sure that only directories are created and no actual data? Anyway, even if we add this check, I think we should fail the procedure and fail the start up, as we know that there are something wrong. Users should use HBCK or other tools to fix the inconsistency before moving forward.

Thanks.

anoopsjohn · 2020-08-13T09:58:33Z

The biggest issue now is META dir delete.. In one direction we are trying NOT to keep any persisted data in zk no? Till 2.1.7 even for META table also, we never had an issue even if that is not there.. The bootstrap code was assigning existing META table to new RS. Some way or the other, I dont mind, we need a solution for this. As per the existing code itself, am not sure why we need to delete the entire dir with assumption of partial bootstrap. There is a dir creation steps and then tableInfo create.. The tableInfo file create is atomic too. Every step we check if dir/file exists . Correct? If this META delete was not there things would have been easy. We have supported cluster drop and create with version <2.1.6.. But from 2.1.7 this is very diff because of this META delete.

Apache9 · 2020-08-13T10:03:00Z

For 2.2+ we have a completely different way to do master initialization. This is exactly what we I said above, it is not easy to force all the developpers to follow the rule, that the data on zookeeper are not stable. Maybe it can work for now, but maybe it will be broken in the future...

Apache9 · 2020-08-13T10:05:34Z

And for the problem here, it is why we introduce the InitMetaProcedure. For InitMetaProcedure itself, the assumption is that it will do the initialization work, it will be very strange that, we schedule a 'InitMeta' procedure, but we already have a meta in place? Why not just do not schedule the InitMetaProcedure at the first place?

anoopsjohn · 2020-08-13T10:11:49Z

InitMetaProcedure was there before also.. Only in 2.1.7, this delete of the META dir added. Am asking abt that only @Apache9 . That is the major concern. The init proc was before doing FS init if the meta dir is not yet there. And then init the assign of the META in the cluster. Now it will do both what so ever, the cur status of the meta dir. The req is a change for that. Is that some thing u veto?

Apache9 · 2020-08-13T10:17:12Z

I do not see why we need to change this. Consider the usage of InitMetaProcedure, if there is a meta directory there, it must be 'partial'. And even it is not partial, we are still safe to delete it, as there should be no data in it.

If this is not the case, we must have inconsistency in our cluster. For safety, I think we could skip the deletion of the meta directory, but we should still fail the procedure and thus fail the initialization of master. Users should fix the inconsistency before starting again.

Apache9 · 2020-08-13T10:34:15Z

Anyway, let's get back to the initial goal? I think it is to support restart a cluster with nothing on zookeeper? Let's have an overall design first? As I said above, it is not easy to force all developpers to follow this rule. If we want to do this, we need to find a suitable way to force the later developpers know, they should not depend on the data on zookeeper...

taklwu · 2020-08-13T23:03:37Z

should not depend on the data on zookeeper.

I agreed with you that we may not be ready to totally skip relying on the data stored on zookeeper, that's definitely a boarder discussion on what HBase currently depends on Zookeeper (branch-2 and master), especially if data on Zookeeper could be ephemeral or removed. (I thought we're in the progress of moving data into ROOT region, aren't we ? e.g. Proc-v2

Also, my initial goal is that the meta data/directory should not be deleted if possible, and we're trying to provide a persisted condition not to always delete meta if it's not partial (protected by the ZK data).

sorry that I may be newbie on the proc-v2 and zk data, should we start a thread on the dev@ list to discuss about the following ? (my goal is to find a consensus how we can move this PR further)

should we delete meta directory when HM starts ?
after 2.2+, should not depend on the data on zookeeper and have more of the info into proc-v2 in the master region?

Apache9 · 2020-08-14T01:50:41Z

I think we'd better start from a high level design, on how we could start a cluster with no data on zookeeper. My proposal is always, introducing a new option in HBCK2 to reconstruct the data on zookeeper, but you actually start the cluster. If you guys think it is possible to do this in normal code path, please give an overall design. Open a google doc? So others could comments on it?

For deleting meta, I agree that we should not delete it if it is not 'partial'. But as I said above many times, if there is no inconsitency, scheduling an InitMetaProcedure implies that we do not have a meta table yet, so in InitMetaProcedure it is always safe to delete the meta directory. So even if we add a check in InitMetaProcedure to determine whether the meta table is 'real' partial, the result should be an ERROR log to tell users something crtical happens, please fix the inconsistency before moving forward.

I do not think this is what you guys want here? As you want to start the HMaster and recover from the inconsistency?

taklwu · 2020-08-14T04:46:21Z

how we could start a cluster with no data on zookeeper?

IMO the title of the google design may be going on the cloud use cases that has been restarting on just HFiles without WAL and without Zookeeper (but all the user tables are flushed and disabled before terminating it). I knew that may not be mentioned in the book tutorial, and it may be a good time to clarify how that cases are actually working and some users has been using in HBase-1.4.x and HBase maybe before 2.1.7. Then we can see what the gaps maybe now in [updated]branch-2.4+ (we don't need that in 2.3.0 and 2.2.5 becase InitMetaProcedure does not have INIT_META_WRITE_FS_LAYOUT) to support it back (basically, that's the intention of having this PR and PR#2113 )

As you want to start the HMaster and recover from the inconsistency

what does inconsistency mean here? I see your point that using InitMetaProcedure#INIT_META_WRITE_FS_LAYOUT to indicate inconsistency, but if we don't delete meta and just starting the cluster, IMO HBCK -detail will show a clean result without any inconsistency? (correct me if I'm wrong) . For this topic, I may just start a email on if InitMetaProcedure should delete meta without checking partial and consistency, please bear with me, and this may be the only thing I want a quick discussion instead of a long design doc on the cloud use cases.

Apache9 · 2020-08-14T05:42:43Z

The inconsistency is between the data on filesystem and zookeeper. In general, after the meta table has been initialized, it should have a directory on the file system, and also a znode to indicate its state on zookeeper.

The problem here is that, in your scenario, the data on zookeeper are all gone. This is what I called 'inconsistency'.

And I've also mentioned several times, it works for several versions does not mean it will work for all the versions. They are just different implementation choices, and you can not force the developpers to not use the data on zookeeper to determine what to do next.

So I do not think it worth to ask on the dev list about 'whether' the InitMetaProcedure should delete meta without checking 'partial' or not. The problem here is you delete the state znode for meta region on zookeeper, not a problem for InitMetaProcedure. Either you add it back through HBCK2, or you should have a overall design on how we could fix the problem, without using HBCK2.

taklwu · 2020-08-14T20:47:10Z

I apologized for the dev@ email, but I was thinking differently overnight about your suggestion (sorry that I reread many times until I found the gap this morning)

You can see the code in finisheActiveMasterInitialization and also the code in AssignmentManager.start. In AssignmentManager.start, we will try to load the start of meta region from zookeeper, and if there is none, we will create a region node in offline state. And in finishActiveMasterInitialization, if we find that the state of meta region node is offline, we will schedule InitMetaProcedure.
So what you need to do here, is to put the meta region znode to zookeeper, before you restart the hbase cluster. So we will not schedule InitMetaProcedure again.

didn't the coming up master region that store the meta location in HBASE-24408 and PR#1746 solve our conflict of interests that we don't need to relaying on ZK for getting the server name (old host) for the meta region ? such even if we don't have the ZK, we can move on and don't submit the InitMetaProcedure because the state of the meta region is not OFFLINE.

if you confirm above, I may say bring this PR and keep highlighting the zookeeper discussion is my mistake and I should have learnt the master region ahead of this PR. (then we just need to move to the coming up version, and we can still restart on the cloud use cases)

[update] I will follow your suggestion and throws a exception if we find the meta table is not empty/partial

Apache9 · 2020-08-15T09:18:55Z

didn't the coming up master region that store the meta location in HBASE-24408 and PR#1746 solve our conflict of interests that we don't need to relaying on ZK for getting the server name (old host) for the meta region ? such even if we don't have the ZK, we can move on and don't submit the InitMetaProcedure because the state of the meta region is not OFFLINE.

HBASE-24388 is only on a feature branch and we are still discussing whether to go this way... There is another solution to introduce a general root table, then we will have InitRootProcedure and the state of the root region is still on zookeeper. Of course I'm a sponsor of HBASE-24388 as it is implemented by me...
And I'm a fan of removing zookeeper dependency completely if possible(or at least just use it as an external storage so on cloud we could easily use other storage systems to replace it, such as dynamodb).

[update] I will follow your suggestion and throws a exception if we find the meta table is not empty/partial

Yes let's do this first to prevent damaging a cluster.

And in general, I do not think it is very hard to introduce a tool for your scenario? Just put a dummy meta state node on zookeeper can solve the problem. And for the regions, as said before, you'd better disable them allbefore shutting down the old cluster, and after starting the new cluster, enable them. Even if you do not do this, I think you could just scan the meta regions, to find out all the recorded region servers and schedule HBCKSCP for them to get the cluster back.

Apache-HBase · 2020-08-18T07:08:54Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	0m 31s	Docker mode activated.
		_ Prechecks _
+1 💚	dupname	0m 0s	No case conflicting files found.
+1 💚	hbaseanti	0m 0s	Patch does not have any anti-patterns.
+1 💚	@author	0m 0s	The patch does not contain any @author tags.
		_ branch-2 Compile Tests _
+1 💚	mvninstall	3m 36s	branch-2 passed
+1 💚	checkstyle	1m 10s	branch-2 passed
+1 💚	spotbugs	1m 59s	branch-2 passed
		_ Patch Compile Tests _
+1 💚	mvninstall	3m 10s	the patch passed
+1 💚	checkstyle	1m 8s	the patch passed
+1 💚	whitespace	0m 0s	The patch has no whitespace issues.
+1 💚	hadoopcheck	11m 32s	Patch does not cause any errors with Hadoop 3.1.2 3.2.1.
+1 💚	spotbugs	2m 11s	the patch passed
		_ Other Tests _
+1 💚	asflicense	0m 15s	The patch does not generate ASF License warnings.
		32m 46s

Subsystem	Report/Notes
Docker	Client=19.03.12 Server=19.03.12 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/2/artifact/yetus-general-check/output/Dockerfile
GITHUB PR	#2237
JIRA Issue	HBASE-24833
Optional Tests	dupname asflicense spotbugs hadoopcheck hbaseanti checkstyle
uname	Linux 7a93156de9ca 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	branch-2 / `7335dbc`
Max. process+thread count	94 (vs. ulimit of 12500)
modules	C: hbase-server U: hbase-server
Console output	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/2/console
versions	git=2.17.1 maven=(cecedd343002696d0abb50b32b541b8a6ba2883f) spotbugs=3.1.12
Powered by	Apache Yetus 0.11.1 https://yetus.apache.org

This message was automatically generated.

Apache-HBase · 2020-08-18T09:10:38Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	1m 36s	Docker mode activated.
-0 ⚠️	yetus	0m 8s	Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
		_ Prechecks _
		_ branch-2 Compile Tests _
+1 💚	mvninstall	4m 14s	branch-2 passed
+1 💚	compile	1m 4s	branch-2 passed
+1 💚	shadedjars	5m 46s	branch has no errors when building our shaded downstream artifacts.
-0 ⚠️	javadoc	0m 41s	hbase-server in branch-2 failed.
		_ Patch Compile Tests _
+1 💚	mvninstall	3m 59s	the patch passed
+1 💚	compile	1m 5s	the patch passed
+1 💚	javac	1m 5s	the patch passed
+1 💚	shadedjars	5m 46s	patch has no errors when building our shaded downstream artifacts.
-0 ⚠️	javadoc	0m 40s	hbase-server in the patch failed.
		_ Other Tests _
+1 💚	unit	127m 20s	hbase-server in the patch passed.
		154m 31s

Subsystem	Report/Notes
Docker	Client=19.03.12 Server=19.03.12 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/2/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR	#2237
JIRA Issue	HBASE-24833
Optional Tests	javac javadoc unit shadedjars compile
uname	Linux bd5910b16740 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	branch-2 / `7335dbc`
Default Java	2020-01-14
javadoc	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/2/artifact/yetus-jdk11-hadoop3-check/output/branch-javadoc-hbase-server.txt
javadoc	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/2/artifact/yetus-jdk11-hadoop3-check/output/patch-javadoc-hbase-server.txt
Test Results	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/2/testReport/
Max. process+thread count	4189 (vs. ulimit of 12500)
modules	C: hbase-server U: hbase-server
Console output	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/2/console
versions	git=2.17.1 maven=(cecedd343002696d0abb50b32b541b8a6ba2883f)
Powered by	Apache Yetus 0.11.1 https://yetus.apache.org

This message was automatically generated.

Apache-HBase · 2020-08-18T09:29:38Z

💔 -1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	0m 33s	Docker mode activated.
-0 ⚠️	yetus	0m 7s	Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
		_ Prechecks _
		_ branch-2 Compile Tests _
+1 💚	mvninstall	3m 49s	branch-2 passed
+1 💚	compile	0m 58s	branch-2 passed
+1 💚	shadedjars	5m 12s	branch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 39s	branch-2 passed
		_ Patch Compile Tests _
+1 💚	mvninstall	3m 25s	the patch passed
+1 💚	compile	1m 0s	the patch passed
+1 💚	javac	1m 0s	the patch passed
+1 💚	shadedjars	5m 10s	patch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 38s	the patch passed
		_ Other Tests _
-1 ❌	unit	150m 2s	hbase-server in the patch failed.
		173m 32s

Subsystem	Report/Notes
Docker	Client=19.03.12 Server=19.03.12 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/2/artifact/yetus-jdk8-hadoop2-check/output/Dockerfile
GITHUB PR	#2237
JIRA Issue	HBASE-24833
Optional Tests	javac javadoc unit shadedjars compile
uname	Linux 51731f86518e 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	branch-2 / `7335dbc`
Default Java	1.8.0_232
unit	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/2/artifact/yetus-jdk8-hadoop2-check/output/patch-unit-hbase-server.txt
Test Results	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/2/testReport/
Max. process+thread count	3277 (vs. ulimit of 12500)
modules	C: hbase-server U: hbase-server
Console output	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/2/console
versions	git=2.17.1 maven=(cecedd343002696d0abb50b32b541b8a6ba2883f)
Powered by	Apache Yetus 0.11.1 https://yetus.apache.org

This message was automatically generated.

taklwu · 2020-08-18T20:45:42Z

hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/InitMetaProcedure.java

+      if (!isPartial) {
+        throw new IOException("Meta table is not partial, please sideline this meta directory "
+          + "or run HBCK to fix this meta table, e.g. rebuild the server hostname in ZNode for the "
+          + "meta region");
+      }


@Apache9, we're failing the InitMetaProcedure with an IOException, and HMaster will fail the master startup if InitMetaProcedure is FAILED with an exception.

Still, alternatively, we could continue the bootstrap without throwing (but this is not good as you recommended)

so, do you think this change align with your comments?

In general the approach is fine. The only concern is the implementation. I do not think InitMetaProcedure support rollback, so what will happen if we throw exception here? You will see a ERROR log in the output to say that the procedure does not support rollback?

[updated] yeah, I saw the UnsupportedOperationException and the timeout in the unit test logs, then the master didn't stop, such that I added a section in HMaster to catch this exception and fail the master startup with another IOException.

Did I do it wrong ? or any idea how I can fail the procedure right?

// wait meta to be initialized after we start procedure executor if (initMetaProc != null) { initMetaProc.await(); if (initMetaProc.isFailed() && initMetaProc.hasException()) { throw new IOException("Failed to initialize meta table", initMetaProc.getException()); }

I would worry that the ProcedureExecutor is going to resubmit the IMP again. You might have to, instead, let IMP finish without an error and determine via some side-effect (e.g. set a variable in HMaster?) if we bailed out of IMP and then fail at that level. Truthfully, I don't remember what happens by default (if all procedures which throw an exception are implicitly retried.

Apache9 · 2020-08-19T03:27:39Z

hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/InitMetaProcedure.java

+      if (!isPartial) {
+        throw new IOException("Meta table is not partial, please sideline this meta directory "
+          + "or run HBCK to fix this meta table, e.g. rebuild the server hostname in ZNode for the "
+          + "meta region");
+      }


In general the approach is fine. The only concern is the implementation. I do not think InitMetaProcedure support rollback, so what will happen if we throw exception here? You will see a ERROR log in the output to say that the procedure does not support rollback?

z-york · 2020-08-20T00:36:01Z

I have been watching this for a while, and I still don't have good clarity on the following case.

[update] I will follow your suggestion and throws a exception if we find the meta table is not empty/partial

Yes let's do this first to prevent damaging a cluster.

@Apache9 Can you explain how a partial meta table will damage a cluster? I have been looking at the code for the InitMetaProcedure and the only thing that I can see that is not idempotent is the directory creation for the NS (and the delete of the meta table, but that is being used as the fix for making things idempotent). If InitMetaProcedure is idempotent, there should be no need to delete the directory.

If we restart a cluster and schedule an InitMetaProcedure:

Don't delete the meta table if it exists
It will then move to Assign (which makes sense, to trigger an IMP it cannot already be assigned on a RS).
Then it will move to create NS (which if I understand the logic correctly):
a. it can remove the create directory if they already exist
b. still overwrite the row in the meta table since it should be the same (or check to see if the row exists first if we want to avoid an overwrite).

After that case, if there is any inconsistency in the meta table, that can be handled via HBCK as usual.

A couple tenants we should always have:

Any potential data loss or corrupting action should be taken by HBCK where the operator is making a conscious choice on how to fix a scenario.
As more information is being stored in hbase:meta, it becomes harder (or at least more time consuming) to regenerate the information from the filesystem. hbase:meta should only be deleted/completely regenerated in HBCK.

Apache9

Anyway, it is better than just removing meta with data. Let's do this first.

We can see how to improve this in the future.

z-york · 2020-08-20T22:59:56Z

I am currently -1 on the implementation pending answers to my questions. I have not been able to find any valid failure modes that will cause data loss or corrupting actions if hbase:meta is onlined. If this is true, we can make InitMetaProcedure idempotent (see above) and we can handle any InitMetaProcedure failure condition.

In my investigation I was actually looking at the master branch which contains an additional state for INIT_META_CREATE_NAMESPACES. There are additional bugs associated with this approach that I will open JIRAs for. I will not bring the additional complexity of folding the namespace table into the meta table into this discussion.

taklwu · 2020-08-24T23:35:10Z

so.....how can we get consensus on this PR or this series of idempotent issues (for InitMetaProcedure) ? I don't mind to break them into more tasks as Zach has created those followup bugs (HBASE-24922 and HBASE-24923), but I don't see a clear agreement among everyone from whether we should continue the bootstrap or fail hard on the bootstrap when we find the meta table in InitMetaProcedure.

how do we move further?

saintstack · 2020-10-01T07:08:09Z

I am currently -1 on the implementation pending answers to my questions. I have not been able to find any valid failure modes that will cause data loss or corrupting actions if hbase:meta is onlined. If this is true, we can make InitMetaProcedure idempotent (see above) and we can handle any InitMetaProcedure failure condition.

@z-york Is clumsy operator deleting the meta location znode by mistake a valid failure mode? Throw in a Master restart soon after (Will IMP run? If so, double assign of meta if the rest of the cluster was up at the time which can make for dataloss as the two meta Regions fight over what hfiles make up the meta Region).

The dataloss will be worse though if we just blanket delete meta dir if it exists already when IMP runs.

bq. ....but I don't see a clear agreement among everyone from whether we should continue the bootstrap or fail hard on the bootstrap when we find the meta table in InitMetaProcedure.

Dunno. Its called IMP. When it runs, there is supposed to be no meta. If there is, then something is not right: i.e. see above clumsy operator. Shouldn't remove the meta dir though if exists already? Fail the master startup? HBCK2 time?

Could do Zach's idea of making it idempotent but IMP scope does not cover writing location in zk so can't have this as a 'step' in IMP. What about adding extra step before assign where we wait asking Master a question about the cluster state such as if any of the RSs that are checking in have Regions on them; i.e. if Regions already assigned, if an already 'up' cluster? Would that help?

You fellows don't want to have to run a script beforehand? ZK is up and just put an empty location up or ask Master or hbck2 to do it for you? You just want the cluster to start up over hfiles? Thanks.

taklwu · 2020-12-01T19:37:07Z

sorry for the delayed response.

Is clumsy operator deleting the meta location znode by mistake a valid failure mode ?

no this is a special case that we have been supporting, where the HBase cluster freshly restarts on top of only flushed HFiles and does not come with WAL or ZK. and we admitted that it's a bit different from the community stand points that WAL and ZK must be both pre-existed when master or/and RSs start on existing HFiles to resume the states left from any procedures.

What about adding extra step before assign where we wait asking Master a question about the cluster state such as if any of the RSs that are checking in have Regions on them; i.e. if Regions already assigned, if an already 'up' cluster? Would that help?

having extra step to check if RSs has any assigned may help, but I don't know if we can do that before the server manager find any region server is online.

You fellows don't want to have to run a script beforehand? ZK is up and just put an empty location up or ask Master or hbck2 to do it for you?

I think HBCK/HBCK2 is performing online repairing, there are few concerns we're having

if the master is not up and running, then we cannot proceed
even if the master is up, the repairing on hundreds or thousand of regions implies long scanning time, which IMO we can save this time by just reloading it from existing meta.
having an additional steps/scripts to start a HBase cluster in the mentioned cloud use case seem a manual/semi-automated step we don't find a good fit to hold and maintain them.

Personally, it's fine to me with throwing exception as Duo suggested, and on our side we need to find a way to continue if we see this exception. then we improve it in the future when we need to completely getting rid of the extra step on hbck.

So, for this PR, if we don't hear any other critical suggestion, maybe I will leave it "close" as unresolved, do you guys agree ?

Apache9 · 2020-12-02T06:00:23Z

sorry for the delayed response.

Is clumsy operator deleting the meta location znode by mistake a valid failure mode ?

no this is a special case that we have been supporting, where the HBase cluster freshly restarts on top of only flushed HFiles and does not come with WAL or ZK. and we admitted that it's a bit different from the community stand points that WAL and ZK must be both pre-existed when master or/and RSs start on existing HFiles to resume the states left from any procedures.
Yes, this is not a typical scenario in the open source version of HBase so I do not think adding the special logic in the open source version is a good idea. In the future new developers who do not know this background may change the code again and cause problems.

What about adding extra step before assign where we wait asking Master a question about the cluster state such as if any of the RSs that are checking in have Regions on them; i.e. if Regions already assigned, if an already 'up' cluster? Would that help?

having extra step to check if RSs has any assigned may help, but I don't know if we can do that before the server manager find any region server is online.

You fellows don't want to have to run a script beforehand? ZK is up and just put an empty location up or ask Master or hbck2 to do it for you?

I think HBCK/HBCK2 is performing online repairing, there are few concerns we're having

if the master is not up and running, then we cannot proceed

even if the master is up, the repairing on hundreds or thousand of regions implies long scanning time, which IMO we can save this time by just reloading it from existing meta.

having an additional steps/scripts to start a HBase cluster in the mentioned cloud use case seem a manual/semi-automated step we don't find a good fit to hold and maintain them.
I'm fine with adding a new command in HBCK2 to do these fix ups before starting a cluster. Personally I do not think HBCK2 'must' put all the fix logic at master side. Buy anyway, since the repo is called hbase-operator-tools, I think it is free for us to create a new sub module to place new scripts? Though for now it only happens on AWS, I think we could abstract it as a general scenario where we want to start a HBase cluster only on HFiles.

Personally, it's fine to me with throwing exception as Duo suggested, and on our side we need to find a way to continue if we see this exception. then we improve it in the future when we need to completely getting rid of the extra step on hbck.

So, for this PR, if we don't hear any other critical suggestion, maybe I will leave it "close" as unresolved, do you guys agree ?
This is a scenario for HBase on cloud, especially for AWS, so I think if you guys want to close it as unresolved, others will not have any strong opinon to object :) Take it easy.

taklwu · 2020-12-28T19:14:23Z

thanks Duo for your understanding and comments ;) and happy holidays!

@z-york your point on idempotent is good, but at this point will you agree that we create a follow up on first add exception and discuss/handle this data loss issues in a different JIRA ? will you reconsider to change your -1 vote on this particular PR ? (btw I can rebase it after we agree this change)

z-york · 2021-01-14T19:41:47Z

I think we can agree to disagree on the automatic/idempotent IMP, we can propose a plan for that if I have more time to devote to it in the future. For now we can keep that as a patch. I think that this PR can still go in with throwing an exception instead of deleting meta (I don't see any reason to submit a separate PR, let's keep this discussion).

I think we should fail instead of allowing an automatic delete of the meta directory (or at least have an option to fail and not delete) since we are continuing to add more metadata into meta and it will become more and more costly to rebuild. I have seen many cases where operators clear out ZK nodes + restart master to unblock some assignment issues, but admittedly that is on 1.x versions of HBase, I think the recovery options might be better in 2.x.

We already have offline meta repair that I believe should be able to solve these issues if an exception is thrown.

btw @saintstack for the double assigned meta scenario wouldn't the IO fencing/lease on the meta WAL handle that? Or will it try to write to a unique WAL each assignment? Just curious, not blocking this PR.

Apache-HBase · 2021-01-14T21:56:55Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	1m 54s	Docker mode activated.
		_ Prechecks _
+1 💚	dupname	0m 0s	No case conflicting files found.
+1 💚	hbaseanti	0m 0s	Patch does not have any anti-patterns.
+1 💚	@author	0m 0s	The patch does not contain any @author tags.
		_ branch-2 Compile Tests _
+1 💚	mvninstall	3m 49s	branch-2 passed
+1 💚	compile	3m 11s	branch-2 passed
+1 💚	checkstyle	1m 10s	branch-2 passed
+1 💚	spotbugs	2m 0s	branch-2 passed
-0 ⚠️	patch	2m 9s	Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
		_ Patch Compile Tests _
+1 💚	mvninstall	3m 25s	the patch passed
+1 💚	compile	3m 27s	the patch passed
+1 💚	javac	3m 27s	the patch passed
+1 💚	checkstyle	1m 14s	the patch passed
+1 💚	whitespace	0m 0s	The patch has no whitespace issues.
+1 💚	hadoopcheck	12m 35s	Patch does not cause any errors with Hadoop 3.1.2 3.2.1.
+1 💚	spotbugs	2m 19s	the patch passed
		_ Other Tests _
+1 💚	asflicense	0m 15s	The patch does not generate ASF License warnings.
		43m 19s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/4/artifact/yetus-general-check/output/Dockerfile
GITHUB PR	#2237
JIRA Issue	HBASE-24833
Optional Tests	dupname asflicense javac spotbugs hadoopcheck hbaseanti checkstyle compile
uname	Linux 95de4940b43d 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	branch-2 / `2bb7beb`
Default Java	AdoptOpenJDK-1.8.0_232-b09
Max. process+thread count	96 (vs. ulimit of 12500)
modules	C: hbase-server U: hbase-server
Console output	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/4/console
versions	git=2.17.1 maven=3.6.3 spotbugs=3.1.12
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

Apache-HBase · 2021-01-14T23:57:51Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	0m 37s	Docker mode activated.
-0 ⚠️	yetus	0m 8s	Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
		_ Prechecks _
		_ branch-2 Compile Tests _
+1 💚	mvninstall	3m 46s	branch-2 passed
+1 💚	compile	0m 59s	branch-2 passed
+1 💚	shadedjars	6m 17s	branch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 40s	branch-2 passed
-0 ⚠️	patch	7m 10s	Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
		_ Patch Compile Tests _
+1 💚	mvninstall	3m 14s	the patch passed
+1 💚	compile	0m 59s	the patch passed
+1 💚	javac	0m 59s	the patch passed
+1 💚	shadedjars	6m 1s	patch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 36s	the patch passed
		_ Other Tests _
+1 💚	unit	138m 42s	hbase-server in the patch passed.
		164m 17s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/4/artifact/yetus-jdk8-hadoop2-check/output/Dockerfile
GITHUB PR	#2237
JIRA Issue	HBASE-24833
Optional Tests	javac javadoc unit shadedjars compile
uname	Linux e39ecde1ac07 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	branch-2 / `2bb7beb`
Default Java	AdoptOpenJDK-1.8.0_232-b09
Test Results	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/4/testReport/
Max. process+thread count	4342 (vs. ulimit of 12500)
modules	C: hbase-server U: hbase-server
Console output	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/4/console
versions	git=2.17.1 maven=3.6.3
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

Apache-HBase · 2021-01-15T01:02:32Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	1m 14s	Docker mode activated.
-0 ⚠️	yetus	0m 8s	Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
		_ Prechecks _
		_ branch-2 Compile Tests _
+1 💚	mvninstall	4m 46s	branch-2 passed
+1 💚	compile	1m 11s	branch-2 passed
+1 💚	shadedjars	7m 28s	branch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 42s	branch-2 passed
-0 ⚠️	patch	8m 21s	Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
		_ Patch Compile Tests _
+1 💚	mvninstall	4m 32s	the patch passed
+1 💚	compile	1m 12s	the patch passed
+1 💚	javac	1m 12s	the patch passed
+1 💚	shadedjars	7m 27s	patch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 41s	the patch passed
		_ Other Tests _
+1 💚	unit	197m 39s	hbase-server in the patch passed.
		228m 55s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/4/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR	#2237
JIRA Issue	HBASE-24833
Optional Tests	javac javadoc unit shadedjars compile
uname	Linux 3d27205dc3dd 4.15.0-128-generic #131-Ubuntu SMP Wed Dec 9 06:57:35 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	branch-2 / `2bb7beb`
Default Java	AdoptOpenJDK-11.0.6+10
Test Results	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/4/testReport/
Max. process+thread count	2576 (vs. ulimit of 12500)
modules	C: hbase-server U: hbase-server
Console output	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/4/console
versions	git=2.17.1 maven=3.6.3
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

taklwu · 2021-01-15T04:39:51Z

I have fixed the conflicts, probably will push it in two days and see if you guys have any comments.

z-york · 2021-01-15T20:22:07Z

hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/InitMetaProcedure.java

+    } finally {
+      if (!isPartial) {
+        throw new IOException("Meta table is not partial, please sideline this meta directory "
+          + "or run HBCK to fix this meta table, e.g. rebuild the server hostname in ZNode for the "


Should we mention the HBCK sub command to help fix this? (Is this offline meta repair?) Since we're adding a new failure mode, it should be clear how to fix this case.

Instead of specific commands to run, what about referencing something in the book (which could be updated if "what to do" changes across release versions)?

good point on documentation, I need more time to figure out if there is existing command(s)/option(s) other than manually sidelining the meta directory to a different location. let's mark it as a blocker/requirement before I merging this PR.

https://issues.apache.org/jira/browse/HBASE-24833?focusedCommentId=17409167&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17409167 rough sketch what I was able to do to fix it.

hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/InitMetaProcedure.java

…it's not partial Add a check on meta bootstrap to skip removing meta table directory When entering state INIT_META_WRITE_FS_LAYOUT. here, if any valid HFiles are found in the meta table directory, it is not partial and throws an Exception to crash the Master startup.

Apache-HBase · 2021-09-07T18:30:28Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	12m 18s	Docker mode activated.
		_ Prechecks _
+1 💚	dupname	0m 0s	No case conflicting files found.
+1 💚	hbaseanti	0m 0s	Patch does not have any anti-patterns.
+1 💚	@author	0m 0s	The patch does not contain any @author tags.
		_ branch-2 Compile Tests _
+1 💚	mvninstall	5m 21s	branch-2 passed
+1 💚	compile	4m 23s	branch-2 passed
+1 💚	checkstyle	1m 40s	branch-2 passed
+1 💚	spotbugs	3m 11s	branch-2 passed
		_ Patch Compile Tests _
+1 💚	mvninstall	5m 8s	the patch passed
+1 💚	compile	3m 48s	the patch passed
+1 💚	javac	3m 48s	the patch passed
+1 💚	checkstyle	1m 17s	the patch passed
+1 💚	whitespace	0m 0s	The patch has no whitespace issues.
+1 💚	hadoopcheck	13m 2s	Patch does not cause any errors with Hadoop 3.1.2 3.2.1.
+1 💚	spotbugs	2m 22s	the patch passed
		_ Other Tests _
+1 💚	asflicense	0m 13s	The patch does not generate ASF License warnings.
		61m 19s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/5/artifact/yetus-general-check/output/Dockerfile
GITHUB PR	#2237
JIRA Issue	HBASE-24833
Optional Tests	dupname asflicense javac spotbugs hadoopcheck hbaseanti checkstyle compile
uname	Linux 1ef067af5fe6 4.15.0-147-generic #151-Ubuntu SMP Fri Jun 18 19:21:19 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	branch-2 / `62c58fc`
Default Java	AdoptOpenJDK-1.8.0_282-b08
Max. process+thread count	86 (vs. ulimit of 12500)
modules	C: hbase-server U: hbase-server
Console output	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/5/console
versions	git=2.17.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

Apache-HBase · 2021-09-07T20:14:34Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	0m 34s	Docker mode activated.
-0 ⚠️	yetus	0m 7s	Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
		_ Prechecks _
		_ branch-2 Compile Tests _
+1 💚	mvninstall	4m 31s	branch-2 passed
+1 💚	compile	1m 7s	branch-2 passed
+1 💚	shadedjars	7m 10s	branch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 44s	branch-2 passed
		_ Patch Compile Tests _
+1 💚	mvninstall	4m 6s	the patch passed
+1 💚	compile	1m 7s	the patch passed
+1 💚	javac	1m 7s	the patch passed
+1 💚	shadedjars	7m 10s	patch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 41s	the patch passed
		_ Other Tests _
+1 💚	unit	135m 40s	hbase-server in the patch passed.
		165m 57s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/5/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR	#2237
JIRA Issue	HBASE-24833
Optional Tests	javac javadoc unit shadedjars compile
uname	Linux b6554c2ebd46 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	branch-2 / `62c58fc`
Default Java	AdoptOpenJDK-11.0.10+9
Test Results	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/5/testReport/
Max. process+thread count	3817 (vs. ulimit of 12500)
modules	C: hbase-server U: hbase-server
Console output	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/5/console
versions	git=2.17.1 maven=3.6.3
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

Apache-HBase · 2021-09-07T20:27:14Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	4m 13s	Docker mode activated.
-0 ⚠️	yetus	0m 6s	Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
		_ Prechecks _
		_ branch-2 Compile Tests _
+1 💚	mvninstall	3m 59s	branch-2 passed
+1 💚	compile	1m 2s	branch-2 passed
+1 💚	shadedjars	6m 36s	branch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 44s	branch-2 passed
		_ Patch Compile Tests _
+1 💚	mvninstall	3m 29s	the patch passed
+1 💚	compile	1m 3s	the patch passed
+1 💚	javac	1m 3s	the patch passed
+1 💚	shadedjars	6m 45s	patch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 37s	the patch passed
		_ Other Tests _
+1 💚	unit	146m 53s	hbase-server in the patch passed.
		177m 48s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/5/artifact/yetus-jdk8-hadoop2-check/output/Dockerfile
GITHUB PR	#2237
JIRA Issue	HBASE-24833
Optional Tests	javac javadoc unit shadedjars compile
uname	Linux e6c4aa84f5e1 4.15.0-136-generic #140-Ubuntu SMP Thu Jan 28 05:20:47 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	branch-2 / `62c58fc`
Default Java	AdoptOpenJDK-1.8.0_282-b08
Test Results	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/5/testReport/
Max. process+thread count	4237 (vs. ulimit of 12500)
modules	C: hbase-server U: hbase-server
Console output	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/5/console
versions	git=2.17.1 maven=3.6.3
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

z-york

+1 this looks reasonable

taklwu · 2021-09-09T17:33:53Z

@joshelser should we wait till add the documentation in this commit ? or should we have a followup PR?

taklwu · 2021-10-12T17:56:54Z

merged and if we need a new documentation, we can have a followup JIRA

taklwu requested a review from anoopsjohn August 10, 2020 22:20

taklwu self-assigned this Aug 11, 2020

taklwu added bug and removed bug labels Aug 11, 2020

anoopsjohn reviewed Aug 11, 2020

View reviewed changes

taklwu mentioned this pull request Aug 13, 2020

HBASE-24286: HMaster won't become healthy after after cloning or crea… #2113

Closed

taklwu force-pushed the HBASE-24833-branch-2 branch from ead2eee to 98119eb Compare August 18, 2020 06:23

taklwu commented Aug 18, 2020

View reviewed changes

Apache9 requested changes Aug 19, 2020

View reviewed changes

Apache9 approved these changes Aug 20, 2020

View reviewed changes

z-york requested changes Jan 15, 2021

View reviewed changes

Stephen Wu added 2 commits September 7, 2021 10:11

Remove Zknode will not hang anymore after HBASE-26193

b5010b2

taklwu force-pushed the HBASE-24833-branch-2 branch from 082d601 to b5010b2 Compare September 7, 2021 17:27

z-york approved these changes Sep 8, 2021

View reviewed changes

taklwu merged commit 3ce9052 into apache:branch-2 Oct 12, 2021

HBASE-24833: Bootstrap should not delete the META table directory if … #2237

HBASE-24833: Bootstrap should not delete the META table directory if … #2237

Uh oh!

Conversation

taklwu commented Aug 10, 2020

Uh oh!

Apache-HBase commented Aug 10, 2020

Uh oh!

Apache-HBase commented Aug 11, 2020

Uh oh!

Apache-HBase commented Aug 11, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

taklwu Aug 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Apache9 commented Aug 11, 2020

Uh oh!

taklwu commented Aug 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anoopsjohn commented Aug 12, 2020

Uh oh!

Apache9 commented Aug 13, 2020

Uh oh!

taklwu commented Aug 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Apache9 commented Aug 13, 2020

Uh oh!

anoopsjohn commented Aug 13, 2020

Uh oh!

anoopsjohn commented Aug 13, 2020

Uh oh!

Apache9 commented Aug 13, 2020

Uh oh!

anoopsjohn commented Aug 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Apache9 commented Aug 13, 2020

Uh oh!

Apache9 commented Aug 13, 2020

Uh oh!

anoopsjohn commented Aug 13, 2020

Uh oh!

Apache9 commented Aug 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Apache9 commented Aug 13, 2020

Uh oh!

taklwu commented Aug 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Apache9 commented Aug 14, 2020

Uh oh!

taklwu commented Aug 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Apache9 commented Aug 14, 2020

Uh oh!

taklwu commented Aug 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Apache9 commented Aug 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Apache-HBase commented Aug 18, 2020

Uh oh!

Apache-HBase commented Aug 18, 2020

Uh oh!

Apache-HBase commented Aug 18, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

taklwu Aug 12, 2020 •

edited

Loading

taklwu commented Aug 11, 2020 •

edited

Loading

taklwu commented Aug 13, 2020 •

edited

Loading

anoopsjohn commented Aug 13, 2020 •

edited

Loading

Apache9 commented Aug 13, 2020 •

edited

Loading

taklwu commented Aug 13, 2020 •

edited

Loading

taklwu commented Aug 14, 2020 •

edited

Loading

taklwu commented Aug 14, 2020 •

edited

Loading

Apache9 commented Aug 15, 2020 •

edited

Loading

taklwu Aug 19, 2020 •

edited

Loading

taklwu commented Dec 1, 2020 •

edited

Loading

taklwu commented Dec 28, 2020 •

edited

Loading