Skip to content

HBASE-24833: Bootstrap should not delete the META table directory if … #2237

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 12, 2021

Conversation

taklwu
Copy link
Contributor

@taklwu taklwu commented Aug 10, 2020

…it's not partial

Add a check on meta bootstrap to skip removing meta table directory
if ZK data does not exist when hmaster restart. here the existence
of clusterID in ZK indicate if the meta is partial if we hit the
INIT_META_WRITE_FS_LAYOUT in InitMetaProcedure

@taklwu taklwu requested a review from anoopsjohn August 10, 2020 22:20
@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 42s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
_ branch-2 Compile Tests _
+1 💚 mvninstall 4m 47s branch-2 passed
+1 💚 checkstyle 1m 48s branch-2 passed
+1 💚 spotbugs 2m 48s branch-2 passed
_ Patch Compile Tests _
+1 💚 mvninstall 4m 10s the patch passed
+1 💚 checkstyle 1m 31s the patch passed
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 hadoopcheck 15m 53s Patch does not cause any errors with Hadoop 3.1.2 3.2.1.
+1 💚 spotbugs 2m 56s the patch passed
_ Other Tests _
+1 💚 asflicense 0m 16s The patch does not generate ASF License warnings.
44m 25s
Subsystem Report/Notes
Docker Client=19.03.12 Server=19.03.12 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/1/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #2237
JIRA Issue HBASE-24833
Optional Tests dupname asflicense spotbugs hadoopcheck hbaseanti checkstyle
uname Linux 96268feba611 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / 17a0c2a
Max. process+thread count 94 (vs. ulimit of 12500)
modules C: hbase-server U: hbase-server
Console output https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/1/console
versions git=2.17.1 maven=(cecedd343002696d0abb50b32b541b8a6ba2883f) spotbugs=3.1.12
Powered by Apache Yetus 0.11.1 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 2m 48s Docker mode activated.
-0 ⚠️ yetus 0m 7s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ branch-2 Compile Tests _
+1 💚 mvninstall 5m 11s branch-2 passed
+1 💚 compile 1m 15s branch-2 passed
+1 💚 shadedjars 6m 38s branch has no errors when building our shaded downstream artifacts.
-0 ⚠️ javadoc 0m 41s hbase-server in branch-2 failed.
_ Patch Compile Tests _
+1 💚 mvninstall 4m 33s the patch passed
+1 💚 compile 1m 17s the patch passed
+1 💚 javac 1m 17s the patch passed
+1 💚 shadedjars 6m 49s patch has no errors when building our shaded downstream artifacts.
-0 ⚠️ javadoc 0m 44s hbase-server in the patch failed.
_ Other Tests _
-1 ❌ unit 137m 7s hbase-server in the patch failed.
169m 27s
Subsystem Report/Notes
Docker Client=19.03.12 Server=19.03.12 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/1/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR #2237
JIRA Issue HBASE-24833
Optional Tests javac javadoc unit shadedjars compile
uname Linux 79f0b2d535a9 4.15.0-65-generic #74-Ubuntu SMP Tue Sep 17 17:06:04 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / 17a0c2a
Default Java 2020-01-14
javadoc https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/1/artifact/yetus-jdk11-hadoop3-check/output/branch-javadoc-hbase-server.txt
javadoc https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/1/artifact/yetus-jdk11-hadoop3-check/output/patch-javadoc-hbase-server.txt
unit https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/1/artifact/yetus-jdk11-hadoop3-check/output/patch-unit-hbase-server.txt
Test Results https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/1/testReport/
Max. process+thread count 3679 (vs. ulimit of 12500)
modules C: hbase-server U: hbase-server
Console output https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/1/console
versions git=2.17.1 maven=(cecedd343002696d0abb50b32b541b8a6ba2883f)
Powered by Apache Yetus 0.11.1 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 1m 23s Docker mode activated.
-0 ⚠️ yetus 0m 7s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ branch-2 Compile Tests _
+1 💚 mvninstall 4m 17s branch-2 passed
+1 💚 compile 1m 5s branch-2 passed
+1 💚 shadedjars 5m 54s branch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 43s branch-2 passed
_ Patch Compile Tests _
+1 💚 mvninstall 4m 0s the patch passed
+1 💚 compile 1m 0s the patch passed
+1 💚 javac 1m 0s the patch passed
+1 💚 shadedjars 5m 44s patch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 39s the patch passed
_ Other Tests _
+1 💚 unit 221m 44s hbase-server in the patch passed.
248m 32s
Subsystem Report/Notes
Docker Client=19.03.12 Server=19.03.12 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/1/artifact/yetus-jdk8-hadoop2-check/output/Dockerfile
GITHUB PR #2237
JIRA Issue HBASE-24833
Optional Tests javac javadoc unit shadedjars compile
uname Linux 257932d04b1f 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / 17a0c2a
Default Java 1.8.0_232
Test Results https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/1/testReport/
Max. process+thread count 2313 (vs. ulimit of 12500)
modules C: hbase-server U: hbase-server
Console output https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/1/console
versions git=2.17.1 maven=(cecedd343002696d0abb50b32b541b8a6ba2883f)
Powered by Apache Yetus 0.11.1 https://yetus.apache.org

This message was automatically generated.

@taklwu taklwu self-assigned this Aug 11, 2020
@taklwu taklwu added bug and removed bug labels Aug 11, 2020
@@ -915,6 +917,11 @@ private void finishActiveMasterInitialization(MonitoredTask status)
this.tableDescriptors.getAll();
}

// check cluster Id stored in ZNode before, and use it to indicate if a cluster has been
// restarted with an existing Zookeeper quorum.
isClusterRestartWithExistingZNodes =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... This check alone is not enough I fear.. Say there is a cluster recreate over existing data and so no zk node for clusterId. We will get true here. In next lines, we will create and write the zk node. Assume immediately after that, the HM restarted. Now after restart, when we come here, we have zk data for clusterId, and we will end up deleting the Meta dir !!
After cluster recreate we will come to know this boolean value before creation of zk node. So we need to record that info some how somewhere so that even if there is a restart before we reach to the Meta bootstrap. Even that info should be there in InitMetaProcedure also.. If there is an HM restart after the submit of this proc, the next time we know the boolean status.
Comments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Anoop, if I get your suggestion correctly, should we add a master procedure (or a state in InitMetaProcedure) which record this boolean? this will have the assumption with the WAL exists when cluster recreates.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya at proc level.. But before that itself one more thing.
Say there is a cluster recreate over existing data and so no zk node for clusterId. We will get true for 'isClusterRestartWithExistingZNodes '. In next lines, we will create and write the zk node for clusterId. Now assume after executing that lines, the HM restarted. So the zk node was created but the InitMetaProc was NOT submitted. Now after restart, when we come here, we have zk data for clusterId. So 'isClusterRestartWithExistingZNodes ' will become false. Now this time the InitMetaProc started and as part of that we will end up deleting the Meta dir.
So this says the need to keep this boolean info somewhere once we find that and even before creating the zk node for ClusterId. Am I making the concern clear this time?

Copy link
Contributor Author

@taklwu taklwu Aug 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this says the need to keep this boolean info somewhere once we find that and even before creating the zk node for ClusterId. Am I making the concern clear this time?

ack and I got the concern.

but before I move to implement the proc level information, let's wait a bit on @Apache9 for the question on META's tableInfo and partial meta. The ideal case is that we may be able to use FSTableDescriptors.getTableDescriptorFromFs() if found and can be readed to indicate meta is not partial.

@Apache9
Copy link
Contributor

Apache9 commented Aug 11, 2020

I still feel a bit strange that we should deal with this in normal code path...

Can we add a new HBCK option for this?

@taklwu
Copy link
Contributor Author

taklwu commented Aug 11, 2020

Thanks @Apache9 , I want to agree with you to have a HBCK option, but one concern I have and keep struggling about making this automated instead of HBCK options. If one HBase cluster has hundred of tables with thousand of regions, how would the operator recovery the cluster? does he/she (offline/online) repair the meta table by scanning the storage on each region ? (instead we can just load the meta without rebuilding it?)

Tbh, I felt bad to bring this meta table issue because normal HBase cluster does not assume Zookeeper (and WAL) could be gone after the cluster starts and restarts.

[updated] for this PR/JIRA, mainly, I'm questioning what a partial meta should be (e.g. it's now relying on the state of InitMetaProcedure instead of the data of meta table), any thoughts ?

@anoopsjohn
Copy link
Contributor

Ya the biggest issue now is the META FS dir delete which happens after such a cluster recreate. An HBCK kind of tool to recreate the META data is too much IMO. Also we many other meta data also in META. So IMHO we should some how avoid this META delete. After cluster recreate this happens because in zk we dont have a META location and also no SCP in MasterProcWal which is targeting an RS which was holding META. So ya the same Q.. What is a partial META we refer? As part of META bootstrap we create the META tableInfo. That is an atomic op. So can be partial here? Sorry if am missing some thing.. Did not check code in detail.

@Apache9
Copy link
Contributor

Apache9 commented Aug 13, 2020

Thanks @Apache9 , I want to agree with you to have a HBCK option, but one concern I have and keep struggling about making this automated instead of HBCK options. If one HBase cluster has hundred of tables with thousand of regions, how would the operator recovery the cluster? does he/she (offline/online) repair the meta table by scanning the storage on each region ? (instead we can just load the meta without rebuilding it?)

I think for the scenario here, we just need to write the cluster id and other things to zookeeper? Just make sure that the current code in HBase will not consider us as a fresh new cluster. We do not need to rebuild meta?

Tbh, I felt bad to bring this meta table issue because normal HBase cluster does not assume Zookeeper (and WAL) could be gone after the cluster starts and restarts.

[updated] for this PR/JIRA, mainly, I'm questioning what a partial meta should be (e.g. it's now relying on the state of InitMetaProcedure instead of the data of meta table), any thoughts ?
After introducing proc-v2, we reply on it to record the state of a multi-step operation. Here, I believe the problem is that, we schedule an InitMetaProcedure when we do have a meta table in place. For the InitMetaProcedure, the assumption is that, if we found that the meta table directory is there, then it means the procedure itself has crashed before finishing the creation of meta table, i.e, the meta table is 'partial'. So it is safe to just remove it and create again. I think this is a very common trick in distributed system for handling failures?

@taklwu
Copy link
Contributor Author

taklwu commented Aug 13, 2020

first of all, thanks Duo again.

I think for the scenario here, we just need to write the cluster id and other things to zookeeper? Just make sure that the current code in HBase will not consider us as a fresh new cluster. We do not need to rebuild meta?

So, let me confirm your suggestion, that means if we add one more field in ZNode, e.g. a boolean completedMetaBoostrap, if we find both clusterId and completedMetaBoostrap in ZK, we will not delete meta directory ?

followup if ZK Znode data is used to determine if this is a fresh new cluster, can we skip the delete meta directory if clusterId and completedMetaBoostrap are never set but we found meta directory? this is the cloud use cases which we don't have ZK to make the decision; such we don't know if the meta is partial, and IMO, we should just leave the meta directory and if anything bad happens, the operator can still run HBCK. (if we do the other way around and always delete the meta, then we're losing the possibility the cluster can heal itself, and we cannot confirm if this is partial, doesn't it?)

For the InitMetaProcedure, the assumption is that, if we found that the meta table directory is there, then it means the procedure itself has crashed before finishing the creation of meta table, i.e, the meta table is 'partial'. So it is safe to just remove it and create again. I think this is a very common trick in distributed system for handling failures?

do you mean idempotent is the trick ? InitMetaProcedure may be idempotent and can make hbase:meta online (as a empty table), but I don't think if the cluster/HM itself is idempotent automatically; and yeah, it can rebuild the data content of the original meta with the help of HBCK, but just if HM continues the flow with some existing data, e.g. the namespace table (sorry for branch-2 we have namespace table) and HM restart with a empty meta, based on the experiment I did, the cluster hangs and HM cannot be initialized.

if we step back to just think on the definition of partial meta, it would be great if the meta table itself can tell if it's partial, because it's still a table in HBase and HFiles are immutable. e.g. can we tell if a user table is partial by looking at its data? I may be wrong, but it seems like we're not able to tell from HFiles itself, and we need ZK and WAL to define it.

So, again, IMO data content in a table is sensitive ([updated] sorry if you guys think data in meta table is not sensitive), I'm proposing not to delete meta directory if possible (it's also like running a hbck to delete and rebuild).

Based on our discussion here, IMO we have two proposal mentioned to define partial meta .

  1. add a boolean in WAL like a proc-level data
  2. write a boolean in ZNode to tell if the bootstrap completes
    *. no matter we choose 1) and 2) above, we have an additional condition, if we don't find any WAL or ZK about this condition, we should not delete the meta table.

seems 2) + *) should be the simplest solution, what do you guys think?

@Apache9
Copy link
Contributor

Apache9 commented Aug 13, 2020

I do not think you fully understand what I proposed from the beginning of the related topic. Honestly I do not know what you are going to fix here, what is the ZNode you want to add a boolean to?

My suggestion is always that, introducing a new HBCK option, to reconstruct the content on zookeeper, so we know that this is not a fresh new cluster, and we will not schedule InitMetaProcedure. So for InitMetaProcedure itself, we do not need to consider whether the meta region is partial or not.

You can see the code in finisheActiveMasterInitialization and also the code in AssignmentManager.start. In AssignmentManager.start, we will try to load the start of meta region from zookeeper, and if there is none, we will create a region node in offline state. And in finishActiveMasterInitialization, if we find that the state of meta region node is offline, we will schedule InitMetaProcedure.
So what you need to do here, is to put the meta region znode to zookeeper, before you restart the hbase cluster. So we will not schedule InitMetaProcedure again.

Is this clear enough?

Thanks.

@anoopsjohn
Copy link
Contributor

Thanks Duo.. I was about to put the flow why we do InitMeta.. In case of the cluster recreate what @taklwu says, the WAL data itself not there as well as zk. So as what u said, if one has to do, then they have to do 2 things

  1. When the initial cluster was dropped, before that the WAL fs (HDFS backed by managed disk in cloud) need to be backedup. Also the zk data (the meta server location) to be backed up
  2. When the cluster recreated from existing data, some tool (hbck or some thing) need to recreate the zk node and put that old location value. Before that it has to recover back the WAL FS data onto the new HDFS cluster.
    The WAL fs data backup and restore make sense. But IMHO the zk data thing looks another hack. Till 2.1.6 this was not needed. Even if the meta location is not there zk, the init meta will kick in and that will create meta region from existing data and assign to some RS. But the meta cleanup make the existing entire data to be deleted. I think that is not good. We are adding more things to META table these days. The NS info itself is in another CF in META table.. There is some other discussion around adding the committed HFiles data into META (This is not concluded but looks like we keep increasing the responsibility of META table). So dropping all these information is not that good.
    So my thinking was to add such cluster recreate as a 1st class feature in HBase itself. This can be used by anyone anywhere. As long as the data is persisted, we can drop the cluster and recreate later. Now for that 2 blockers and the biggest one is this META dir delete as part of InitMetaProc. I agree to the intent of adding that cleanup. But now if we have to have this support also how can we avoid this delete? Recreate entire META using some hbck options should be the very last option to think off IMO. If we can solve other ways why not?

@anoopsjohn
Copy link
Contributor

To ur Q on the status on zk for clusterId @taklwu , The clusterId stored in zk itself is a PBed message. So we can always add the boolean optional field there. And later change also. So no need for new nodes or any.. But lets wait for a conclusion on the direction. Like still we dont have consensus on whether adding this as a top class HBase thing itself? (Bootstrap the cluster over existing data). If top class feature, we dont have to do many tricks to satisfy such needs. IMHO that is worth adding as all can get benefited.

@Apache9
Copy link
Contributor

Apache9 commented Aug 13, 2020

OK, so your point, is to add a feature that, we could start a cluster without no data on zk? I would vote a -1 here as I do not think it is possible. And even we could make it work for now, we can not force all later developpers to keep in mind that, all data on zk are not stable, you should not depend on them... It does not make sense.

And on the InitMetaProcedure, maybe we could avoid deleting the whole meta, if there are HFiles in it. This is a possible improvement, to avoid damage the cluster if there are something wrong with zk. As for InitMetaProcedure, if we are at the first state, we can make sure that only directories are created and no actual data? Anyway, even if we add this check, I think we should fail the procedure and fail the start up, as we know that there are something wrong. Users should use HBCK or other tools to fix the inconsistency before moving forward.

Thanks.

@anoopsjohn
Copy link
Contributor

anoopsjohn commented Aug 13, 2020

The biggest issue now is META dir delete.. In one direction we are trying NOT to keep any persisted data in zk no? Till 2.1.7 even for META table also, we never had an issue even if that is not there.. The bootstrap code was assigning existing META table to new RS. Some way or the other, I dont mind, we need a solution for this. As per the existing code itself, am not sure why we need to delete the entire dir with assumption of partial bootstrap. There is a dir creation steps and then tableInfo create.. The tableInfo file create is atomic too. Every step we check if dir/file exists . Correct? If this META delete was not there things would have been easy. We have supported cluster drop and create with version <2.1.6.. But from 2.1.7 this is very diff because of this META delete.

@Apache9
Copy link
Contributor

Apache9 commented Aug 13, 2020

For 2.2+ we have a completely different way to do master initialization. This is exactly what we I said above, it is not easy to force all the developpers to follow the rule, that the data on zookeeper are not stable. Maybe it can work for now, but maybe it will be broken in the future...

@Apache9
Copy link
Contributor

Apache9 commented Aug 13, 2020

And for the problem here, it is why we introduce the InitMetaProcedure. For InitMetaProcedure itself, the assumption is that it will do the initialization work, it will be very strange that, we schedule a 'InitMeta' procedure, but we already have a meta in place? Why not just do not schedule the InitMetaProcedure at the first place?

@anoopsjohn
Copy link
Contributor

InitMetaProcedure was there before also.. Only in 2.1.7, this delete of the META dir added. Am asking abt that only @Apache9 . That is the major concern. The init proc was before doing FS init if the meta dir is not yet there. And then init the assign of the META in the cluster. Now it will do both what so ever, the cur status of the meta dir. The req is a change for that. Is that some thing u veto?

@Apache9
Copy link
Contributor

Apache9 commented Aug 13, 2020

I do not see why we need to change this. Consider the usage of InitMetaProcedure, if there is a meta directory there, it must be 'partial'. And even it is not partial, we are still safe to delete it, as there should be no data in it.

If this is not the case, we must have inconsistency in our cluster. For safety, I think we could skip the deletion of the meta directory, but we should still fail the procedure and thus fail the initialization of master. Users should fix the inconsistency before starting again.

@Apache9
Copy link
Contributor

Apache9 commented Aug 13, 2020

Anyway, let's get back to the initial goal? I think it is to support restart a cluster with nothing on zookeeper? Let's have an overall design first? As I said above, it is not easy to force all developpers to follow this rule. If we want to do this, we need to find a suitable way to force the later developpers know, they should not depend on the data on zookeeper...

@taklwu
Copy link
Contributor Author

taklwu commented Aug 13, 2020

should not depend on the data on zookeeper.

I agreed with you that we may not be ready to totally skip relying on the data stored on zookeeper, that's definitely a boarder discussion on what HBase currently depends on Zookeeper (branch-2 and master), especially if data on Zookeeper could be ephemeral or removed. (I thought we're in the progress of moving data into ROOT region, aren't we ? e.g. Proc-v2

Also, my initial goal is that the meta data/directory should not be deleted if possible, and we're trying to provide a persisted condition not to always delete meta if it's not partial (protected by the ZK data).

sorry that I may be newbie on the proc-v2 and zk data, should we start a thread on the dev@ list to discuss about the following ? (my goal is to find a consensus how we can move this PR further)

  1. should we delete meta directory when HM starts ?
  2. after 2.2+, should not depend on the data on zookeeper and have more of the info into proc-v2 in the master region?

@Apache9
Copy link
Contributor

Apache9 commented Aug 14, 2020

I think we'd better start from a high level design, on how we could start a cluster with no data on zookeeper. My proposal is always, introducing a new option in HBCK2 to reconstruct the data on zookeeper, but you actually start the cluster. If you guys think it is possible to do this in normal code path, please give an overall design. Open a google doc? So others could comments on it?

For deleting meta, I agree that we should not delete it if it is not 'partial'. But as I said above many times, if there is no inconsitency, scheduling an InitMetaProcedure implies that we do not have a meta table yet, so in InitMetaProcedure it is always safe to delete the meta directory. So even if we add a check in InitMetaProcedure to determine whether the meta table is 'real' partial, the result should be an ERROR log to tell users something crtical happens, please fix the inconsistency before moving forward.

I do not think this is what you guys want here? As you want to start the HMaster and recover from the inconsistency?

@taklwu
Copy link
Contributor Author

taklwu commented Aug 14, 2020

how we could start a cluster with no data on zookeeper?

IMO the title of the google design may be going on the cloud use cases that has been restarting on just HFiles without WAL and without Zookeeper (but all the user tables are flushed and disabled before terminating it). I knew that may not be mentioned in the book tutorial, and it may be a good time to clarify how that cases are actually working and some users has been using in HBase-1.4.x and HBase maybe before 2.1.7. Then we can see what the gaps maybe now in [updated]branch-2.4+ (we don't need that in 2.3.0 and 2.2.5 becase InitMetaProcedure does not have INIT_META_WRITE_FS_LAYOUT) to support it back (basically, that's the intention of having this PR and PR#2113 )

As you want to start the HMaster and recover from the inconsistency

what does inconsistency mean here? I see your point that using InitMetaProcedure#INIT_META_WRITE_FS_LAYOUT to indicate inconsistency, but if we don't delete meta and just starting the cluster, IMO HBCK -detail will show a clean result without any inconsistency? (correct me if I'm wrong) . For this topic, I may just start a email on if InitMetaProcedure should delete meta without checking partial and consistency, please bear with me, and this may be the only thing I want a quick discussion instead of a long design doc on the cloud use cases.

@Apache9
Copy link
Contributor

Apache9 commented Aug 14, 2020

The inconsistency is between the data on filesystem and zookeeper. In general, after the meta table has been initialized, it should have a directory on the file system, and also a znode to indicate its state on zookeeper.

The problem here is that, in your scenario, the data on zookeeper are all gone. This is what I called 'inconsistency'.

And I've also mentioned several times, it works for several versions does not mean it will work for all the versions. They are just different implementation choices, and you can not force the developpers to not use the data on zookeeper to determine what to do next.

So I do not think it worth to ask on the dev list about 'whether' the InitMetaProcedure should delete meta without checking 'partial' or not. The problem here is you delete the state znode for meta region on zookeeper, not a problem for InitMetaProcedure. Either you add it back through HBCK2, or you should have a overall design on how we could fix the problem, without using HBCK2.

@taklwu
Copy link
Contributor Author

taklwu commented Aug 14, 2020

I apologized for the dev@ email, but I was thinking differently overnight about your suggestion (sorry that I reread many times until I found the gap this morning)

You can see the code in finisheActiveMasterInitialization and also the code in AssignmentManager.start. In AssignmentManager.start, we will try to load the start of meta region from zookeeper, and if there is none, we will create a region node in offline state. And in finishActiveMasterInitialization, if we find that the state of meta region node is offline, we will schedule InitMetaProcedure.
So what you need to do here, is to put the meta region znode to zookeeper, before you restart the hbase cluster. So we will not schedule InitMetaProcedure again.

didn't the coming up master region that store the meta location in HBASE-24408 and PR#1746 solve our conflict of interests that we don't need to relaying on ZK for getting the server name (old host) for the meta region ? such even if we don't have the ZK, we can move on and don't submit the InitMetaProcedure because the state of the meta region is not OFFLINE.

if you confirm above, I may say bring this PR and keep highlighting the zookeeper discussion is my mistake and I should have learnt the master region ahead of this PR. (then we just need to move to the coming up version, and we can still restart on the cloud use cases)

[update] I will follow your suggestion and throws a exception if we find the meta table is not empty/partial

@Apache9
Copy link
Contributor

Apache9 commented Aug 15, 2020

didn't the coming up master region that store the meta location in HBASE-24408 and PR#1746 solve our conflict of interests that we don't need to relaying on ZK for getting the server name (old host) for the meta region ? such even if we don't have the ZK, we can move on and don't submit the InitMetaProcedure because the state of the meta region is not OFFLINE.

HBASE-24388 is only on a feature branch and we are still discussing whether to go this way... There is another solution to introduce a general root table, then we will have InitRootProcedure and the state of the root region is still on zookeeper. Of course I'm a sponsor of HBASE-24388 as it is implemented by me...
And I'm a fan of removing zookeeper dependency completely if possible(or at least just use it as an external storage so on cloud we could easily use other storage systems to replace it, such as dynamodb).

[update] I will follow your suggestion and throws a exception if we find the meta table is not empty/partial

Yes let's do this first to prevent damaging a cluster.

And in general, I do not think it is very hard to introduce a tool for your scenario? Just put a dummy meta state node on zookeeper can solve the problem. And for the regions, as said before, you'd better disable them allbefore shutting down the old cluster, and after starting the new cluster, enable them. Even if you do not do this, I think you could just scan the meta regions, to find out all the recorded region servers and schedule HBCKSCP for them to get the cluster back.

@taklwu taklwu force-pushed the HBASE-24833-branch-2 branch from ead2eee to 98119eb Compare August 18, 2020 06:23
@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 31s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
_ branch-2 Compile Tests _
+1 💚 mvninstall 3m 36s branch-2 passed
+1 💚 checkstyle 1m 10s branch-2 passed
+1 💚 spotbugs 1m 59s branch-2 passed
_ Patch Compile Tests _
+1 💚 mvninstall 3m 10s the patch passed
+1 💚 checkstyle 1m 8s the patch passed
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 hadoopcheck 11m 32s Patch does not cause any errors with Hadoop 3.1.2 3.2.1.
+1 💚 spotbugs 2m 11s the patch passed
_ Other Tests _
+1 💚 asflicense 0m 15s The patch does not generate ASF License warnings.
32m 46s
Subsystem Report/Notes
Docker Client=19.03.12 Server=19.03.12 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/2/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #2237
JIRA Issue HBASE-24833
Optional Tests dupname asflicense spotbugs hadoopcheck hbaseanti checkstyle
uname Linux 7a93156de9ca 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / 7335dbc
Max. process+thread count 94 (vs. ulimit of 12500)
modules C: hbase-server U: hbase-server
Console output https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/2/console
versions git=2.17.1 maven=(cecedd343002696d0abb50b32b541b8a6ba2883f) spotbugs=3.1.12
Powered by Apache Yetus 0.11.1 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 1m 36s Docker mode activated.
-0 ⚠️ yetus 0m 8s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ branch-2 Compile Tests _
+1 💚 mvninstall 4m 14s branch-2 passed
+1 💚 compile 1m 4s branch-2 passed
+1 💚 shadedjars 5m 46s branch has no errors when building our shaded downstream artifacts.
-0 ⚠️ javadoc 0m 41s hbase-server in branch-2 failed.
_ Patch Compile Tests _
+1 💚 mvninstall 3m 59s the patch passed
+1 💚 compile 1m 5s the patch passed
+1 💚 javac 1m 5s the patch passed
+1 💚 shadedjars 5m 46s patch has no errors when building our shaded downstream artifacts.
-0 ⚠️ javadoc 0m 40s hbase-server in the patch failed.
_ Other Tests _
+1 💚 unit 127m 20s hbase-server in the patch passed.
154m 31s
Subsystem Report/Notes
Docker Client=19.03.12 Server=19.03.12 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/2/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR #2237
JIRA Issue HBASE-24833
Optional Tests javac javadoc unit shadedjars compile
uname Linux bd5910b16740 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / 7335dbc
Default Java 2020-01-14
javadoc https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/2/artifact/yetus-jdk11-hadoop3-check/output/branch-javadoc-hbase-server.txt
javadoc https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/2/artifact/yetus-jdk11-hadoop3-check/output/patch-javadoc-hbase-server.txt
Test Results https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/2/testReport/
Max. process+thread count 4189 (vs. ulimit of 12500)
modules C: hbase-server U: hbase-server
Console output https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/2/console
versions git=2.17.1 maven=(cecedd343002696d0abb50b32b541b8a6ba2883f)
Powered by Apache Yetus 0.11.1 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 33s Docker mode activated.
-0 ⚠️ yetus 0m 7s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ branch-2 Compile Tests _
+1 💚 mvninstall 3m 49s branch-2 passed
+1 💚 compile 0m 58s branch-2 passed
+1 💚 shadedjars 5m 12s branch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 39s branch-2 passed
_ Patch Compile Tests _
+1 💚 mvninstall 3m 25s the patch passed
+1 💚 compile 1m 0s the patch passed
+1 💚 javac 1m 0s the patch passed
+1 💚 shadedjars 5m 10s patch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 38s the patch passed
_ Other Tests _
-1 ❌ unit 150m 2s hbase-server in the patch failed.
173m 32s
Subsystem Report/Notes
Docker Client=19.03.12 Server=19.03.12 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/2/artifact/yetus-jdk8-hadoop2-check/output/Dockerfile
GITHUB PR #2237
JIRA Issue HBASE-24833
Optional Tests javac javadoc unit shadedjars compile
uname Linux 51731f86518e 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / 7335dbc
Default Java 1.8.0_232
unit https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/2/artifact/yetus-jdk8-hadoop2-check/output/patch-unit-hbase-server.txt
Test Results https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/2/testReport/
Max. process+thread count 3277 (vs. ulimit of 12500)
modules C: hbase-server U: hbase-server
Console output https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/2/console
versions git=2.17.1 maven=(cecedd343002696d0abb50b32b541b8a6ba2883f)
Powered by Apache Yetus 0.11.1 https://yetus.apache.org

This message was automatically generated.

Comment on lines 196 to 199
if (!isPartial) {
throw new IOException("Meta table is not partial, please sideline this meta directory "
+ "or run HBCK to fix this meta table, e.g. rebuild the server hostname in ZNode for the "
+ "meta region");
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Apache9, we're failing the InitMetaProcedure with an IOException, and HMaster will fail the master startup if InitMetaProcedure is FAILED with an exception.

Still, alternatively, we could continue the bootstrap without throwing (but this is not good as you recommended)

so, do you think this change align with your comments?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general the approach is fine. The only concern is the implementation. I do not think InitMetaProcedure support rollback, so what will happen if we throw exception here? You will see a ERROR log in the output to say that the procedure does not support rollback?

Copy link
Contributor Author

@taklwu taklwu Aug 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[updated] yeah, I saw the UnsupportedOperationException and the timeout in the unit test logs, then the master didn't stop, such that I added a section in HMaster to catch this exception and fail the master startup with another IOException.

Did I do it wrong ? or any idea how I can fail the procedure right?

    // wait meta to be initialized after we start procedure executor
    if (initMetaProc != null) {
      initMetaProc.await();
      if (initMetaProc.isFailed() && initMetaProc.hasException()) {
        throw new IOException("Failed to initialize meta table", initMetaProc.getException());
      }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would worry that the ProcedureExecutor is going to resubmit the IMP again. You might have to, instead, let IMP finish without an error and determine via some side-effect (e.g. set a variable in HMaster?) if we bailed out of IMP and then fail at that level. Truthfully, I don't remember what happens by default (if all procedures which throw an exception are implicitly retried.

Comment on lines 196 to 199
if (!isPartial) {
throw new IOException("Meta table is not partial, please sideline this meta directory "
+ "or run HBCK to fix this meta table, e.g. rebuild the server hostname in ZNode for the "
+ "meta region");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general the approach is fine. The only concern is the implementation. I do not think InitMetaProcedure support rollback, so what will happen if we throw exception here? You will see a ERROR log in the output to say that the procedure does not support rollback?

@z-york
Copy link
Contributor

z-york commented Aug 20, 2020

I have been watching this for a while, and I still don't have good clarity on the following case.

[update] I will follow your suggestion and throws a exception if we find the meta table is not empty/partial

Yes let's do this first to prevent damaging a cluster.

@Apache9 Can you explain how a partial meta table will damage a cluster? I have been looking at the code for the InitMetaProcedure and the only thing that I can see that is not idempotent is the directory creation for the NS (and the delete of the meta table, but that is being used as the fix for making things idempotent). If InitMetaProcedure is idempotent, there should be no need to delete the directory.

If we restart a cluster and schedule an InitMetaProcedure:

  1. Don't delete the meta table if it exists
  2. It will then move to Assign (which makes sense, to trigger an IMP it cannot already be assigned on a RS).
  3. Then it will move to create NS (which if I understand the logic correctly):
    a. it can remove the create directory if they already exist
    b. still overwrite the row in the meta table since it should be the same (or check to see if the row exists first if we want to avoid an overwrite).

After that case, if there is any inconsistency in the meta table, that can be handled via HBCK as usual.

A couple tenants we should always have:

  1. Any potential data loss or corrupting action should be taken by HBCK where the operator is making a conscious choice on how to fix a scenario.
  2. As more information is being stored in hbase:meta, it becomes harder (or at least more time consuming) to regenerate the information from the filesystem. hbase:meta should only be deleted/completely regenerated in HBCK.

Copy link
Contributor

@Apache9 Apache9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, it is better than just removing meta with data. Let's do this first.

We can see how to improve this in the future.

@z-york
Copy link
Contributor

z-york commented Aug 20, 2020

I am currently -1 on the implementation pending answers to my questions. I have not been able to find any valid failure modes that will cause data loss or corrupting actions if hbase:meta is onlined. If this is true, we can make InitMetaProcedure idempotent (see above) and we can handle any InitMetaProcedure failure condition.

In my investigation I was actually looking at the master branch which contains an additional state for INIT_META_CREATE_NAMESPACES. There are additional bugs associated with this approach that I will open JIRAs for. I will not bring the additional complexity of folding the namespace table into the meta table into this discussion.

@taklwu
Copy link
Contributor Author

taklwu commented Aug 24, 2020

so.....how can we get consensus on this PR or this series of idempotent issues (for InitMetaProcedure) ? I don't mind to break them into more tasks as Zach has created those followup bugs (HBASE-24922 and HBASE-24923), but I don't see a clear agreement among everyone from whether we should continue the bootstrap or fail hard on the bootstrap when we find the meta table in InitMetaProcedure.

how do we move further?

@saintstack
Copy link
Contributor

I am currently -1 on the implementation pending answers to my questions. I have not been able to find any valid failure modes that will cause data loss or corrupting actions if hbase:meta is onlined. If this is true, we can make InitMetaProcedure idempotent (see above) and we can handle any InitMetaProcedure failure condition.

@z-york Is clumsy operator deleting the meta location znode by mistake a valid failure mode? Throw in a Master restart soon after (Will IMP run? If so, double assign of meta if the rest of the cluster was up at the time which can make for dataloss as the two meta Regions fight over what hfiles make up the meta Region).

The dataloss will be worse though if we just blanket delete meta dir if it exists already when IMP runs.

bq. ....but I don't see a clear agreement among everyone from whether we should continue the bootstrap or fail hard on the bootstrap when we find the meta table in InitMetaProcedure.

Dunno. Its called IMP. When it runs, there is supposed to be no meta. If there is, then something is not right: i.e. see above clumsy operator. Shouldn't remove the meta dir though if exists already? Fail the master startup? HBCK2 time?

Could do Zach's idea of making it idempotent but IMP scope does not cover writing location in zk so can't have this as a 'step' in IMP. What about adding extra step before assign where we wait asking Master a question about the cluster state such as if any of the RSs that are checking in have Regions on them; i.e. if Regions already assigned, if an already 'up' cluster? Would that help?

You fellows don't want to have to run a script beforehand? ZK is up and just put an empty location up or ask Master or hbck2 to do it for you? You just want the cluster to start up over hfiles? Thanks.

@taklwu
Copy link
Contributor Author

taklwu commented Dec 1, 2020

sorry for the delayed response.

Is clumsy operator deleting the meta location znode by mistake a valid failure mode ?

no this is a special case that we have been supporting, where the HBase cluster freshly restarts on top of only flushed HFiles and does not come with WAL or ZK. and we admitted that it's a bit different from the community stand points that WAL and ZK must be both pre-existed when master or/and RSs start on existing HFiles to resume the states left from any procedures.

What about adding extra step before assign where we wait asking Master a question about the cluster state such as if any of the RSs that are checking in have Regions on them; i.e. if Regions already assigned, if an already 'up' cluster? Would that help?

having extra step to check if RSs has any assigned may help, but I don't know if we can do that before the server manager find any region server is online.

You fellows don't want to have to run a script beforehand? ZK is up and just put an empty location up or ask Master or hbck2 to do it for you?

I think HBCK/HBCK2 is performing online repairing, there are few concerns we're having

  1. if the master is not up and running, then we cannot proceed
  2. even if the master is up, the repairing on hundreds or thousand of regions implies long scanning time, which IMO we can save this time by just reloading it from existing meta.
  3. having an additional steps/scripts to start a HBase cluster in the mentioned cloud use case seem a manual/semi-automated step we don't find a good fit to hold and maintain them.

Personally, it's fine to me with throwing exception as Duo suggested, and on our side we need to find a way to continue if we see this exception. then we improve it in the future when we need to completely getting rid of the extra step on hbck.

So, for this PR, if we don't hear any other critical suggestion, maybe I will leave it "close" as unresolved, do you guys agree ?

@Apache9
Copy link
Contributor

Apache9 commented Dec 2, 2020

sorry for the delayed response.

Is clumsy operator deleting the meta location znode by mistake a valid failure mode ?

no this is a special case that we have been supporting, where the HBase cluster freshly restarts on top of only flushed HFiles and does not come with WAL or ZK. and we admitted that it's a bit different from the community stand points that WAL and ZK must be both pre-existed when master or/and RSs start on existing HFiles to resume the states left from any procedures.
Yes, this is not a typical scenario in the open source version of HBase so I do not think adding the special logic in the open source version is a good idea. In the future new developers who do not know this background may change the code again and cause problems.

What about adding extra step before assign where we wait asking Master a question about the cluster state such as if any of the RSs that are checking in have Regions on them; i.e. if Regions already assigned, if an already 'up' cluster? Would that help?

having extra step to check if RSs has any assigned may help, but I don't know if we can do that before the server manager find any region server is online.

You fellows don't want to have to run a script beforehand? ZK is up and just put an empty location up or ask Master or hbck2 to do it for you?

I think HBCK/HBCK2 is performing online repairing, there are few concerns we're having

  1. if the master is not up and running, then we cannot proceed
  2. even if the master is up, the repairing on hundreds or thousand of regions implies long scanning time, which IMO we can save this time by just reloading it from existing meta.
  3. having an additional steps/scripts to start a HBase cluster in the mentioned cloud use case seem a manual/semi-automated step we don't find a good fit to hold and maintain them.
    I'm fine with adding a new command in HBCK2 to do these fix ups before starting a cluster. Personally I do not think HBCK2 'must' put all the fix logic at master side. Buy anyway, since the repo is called hbase-operator-tools, I think it is free for us to create a new sub module to place new scripts? Though for now it only happens on AWS, I think we could abstract it as a general scenario where we want to start a HBase cluster only on HFiles.

Personally, it's fine to me with throwing exception as Duo suggested, and on our side we need to find a way to continue if we see this exception. then we improve it in the future when we need to completely getting rid of the extra step on hbck.

So, for this PR, if we don't hear any other critical suggestion, maybe I will leave it "close" as unresolved, do you guys agree ?
This is a scenario for HBase on cloud, especially for AWS, so I think if you guys want to close it as unresolved, others will not have any strong opinon to object :) Take it easy.

@taklwu
Copy link
Contributor Author

taklwu commented Dec 28, 2020

thanks Duo for your understanding and comments ;) and happy holidays!

@z-york your point on idempotent is good, but at this point will you agree that we create a follow up on first add exception and discuss/handle this data loss issues in a different JIRA ? will you reconsider to change your -1 vote on this particular PR ? (btw I can rebase it after we agree this change)

@z-york
Copy link
Contributor

z-york commented Jan 14, 2021

I think we can agree to disagree on the automatic/idempotent IMP, we can propose a plan for that if I have more time to devote to it in the future. For now we can keep that as a patch. I think that this PR can still go in with throwing an exception instead of deleting meta (I don't see any reason to submit a separate PR, let's keep this discussion).

I think we should fail instead of allowing an automatic delete of the meta directory (or at least have an option to fail and not delete) since we are continuing to add more metadata into meta and it will become more and more costly to rebuild. I have seen many cases where operators clear out ZK nodes + restart master to unblock some assignment issues, but admittedly that is on 1.x versions of HBase, I think the recovery options might be better in 2.x.

We already have offline meta repair that I believe should be able to solve these issues if an exception is thrown.

btw @saintstack for the double assigned meta scenario wouldn't the IO fencing/lease on the meta WAL handle that? Or will it try to write to a unique WAL each assignment? Just curious, not blocking this PR.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 1m 54s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
_ branch-2 Compile Tests _
+1 💚 mvninstall 3m 49s branch-2 passed
+1 💚 compile 3m 11s branch-2 passed
+1 💚 checkstyle 1m 10s branch-2 passed
+1 💚 spotbugs 2m 0s branch-2 passed
-0 ⚠️ patch 2m 9s Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
_ Patch Compile Tests _
+1 💚 mvninstall 3m 25s the patch passed
+1 💚 compile 3m 27s the patch passed
+1 💚 javac 3m 27s the patch passed
+1 💚 checkstyle 1m 14s the patch passed
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 hadoopcheck 12m 35s Patch does not cause any errors with Hadoop 3.1.2 3.2.1.
+1 💚 spotbugs 2m 19s the patch passed
_ Other Tests _
+1 💚 asflicense 0m 15s The patch does not generate ASF License warnings.
43m 19s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/4/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #2237
JIRA Issue HBASE-24833
Optional Tests dupname asflicense javac spotbugs hadoopcheck hbaseanti checkstyle compile
uname Linux 95de4940b43d 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / 2bb7beb
Default Java AdoptOpenJDK-1.8.0_232-b09
Max. process+thread count 96 (vs. ulimit of 12500)
modules C: hbase-server U: hbase-server
Console output https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/4/console
versions git=2.17.1 maven=3.6.3 spotbugs=3.1.12
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 37s Docker mode activated.
-0 ⚠️ yetus 0m 8s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ branch-2 Compile Tests _
+1 💚 mvninstall 3m 46s branch-2 passed
+1 💚 compile 0m 59s branch-2 passed
+1 💚 shadedjars 6m 17s branch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 40s branch-2 passed
-0 ⚠️ patch 7m 10s Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
_ Patch Compile Tests _
+1 💚 mvninstall 3m 14s the patch passed
+1 💚 compile 0m 59s the patch passed
+1 💚 javac 0m 59s the patch passed
+1 💚 shadedjars 6m 1s patch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 36s the patch passed
_ Other Tests _
+1 💚 unit 138m 42s hbase-server in the patch passed.
164m 17s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/4/artifact/yetus-jdk8-hadoop2-check/output/Dockerfile
GITHUB PR #2237
JIRA Issue HBASE-24833
Optional Tests javac javadoc unit shadedjars compile
uname Linux e39ecde1ac07 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / 2bb7beb
Default Java AdoptOpenJDK-1.8.0_232-b09
Test Results https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/4/testReport/
Max. process+thread count 4342 (vs. ulimit of 12500)
modules C: hbase-server U: hbase-server
Console output https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/4/console
versions git=2.17.1 maven=3.6.3
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 1m 14s Docker mode activated.
-0 ⚠️ yetus 0m 8s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ branch-2 Compile Tests _
+1 💚 mvninstall 4m 46s branch-2 passed
+1 💚 compile 1m 11s branch-2 passed
+1 💚 shadedjars 7m 28s branch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 42s branch-2 passed
-0 ⚠️ patch 8m 21s Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
_ Patch Compile Tests _
+1 💚 mvninstall 4m 32s the patch passed
+1 💚 compile 1m 12s the patch passed
+1 💚 javac 1m 12s the patch passed
+1 💚 shadedjars 7m 27s patch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 41s the patch passed
_ Other Tests _
+1 💚 unit 197m 39s hbase-server in the patch passed.
228m 55s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/4/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR #2237
JIRA Issue HBASE-24833
Optional Tests javac javadoc unit shadedjars compile
uname Linux 3d27205dc3dd 4.15.0-128-generic #131-Ubuntu SMP Wed Dec 9 06:57:35 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / 2bb7beb
Default Java AdoptOpenJDK-11.0.6+10
Test Results https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/4/testReport/
Max. process+thread count 2576 (vs. ulimit of 12500)
modules C: hbase-server U: hbase-server
Console output https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/4/console
versions git=2.17.1 maven=3.6.3
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@taklwu
Copy link
Contributor Author

taklwu commented Jan 15, 2021

I have fixed the conflicts, probably will push it in two days and see if you guys have any comments.

} finally {
if (!isPartial) {
throw new IOException("Meta table is not partial, please sideline this meta directory "
+ "or run HBCK to fix this meta table, e.g. rebuild the server hostname in ZNode for the "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mention the HBCK sub command to help fix this? (Is this offline meta repair?) Since we're adding a new failure mode, it should be clear how to fix this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of specific commands to run, what about referencing something in the book (which could be updated if "what to do" changes across release versions)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point on documentation, I need more time to figure out if there is existing command(s)/option(s) other than manually sidelining the meta directory to a different location. let's mark it as a blocker/requirement before I merging this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stephen Wu added 2 commits September 7, 2021 10:11
…it's not partial

Add a check on meta bootstrap to skip removing meta table directory
When entering state INIT_META_WRITE_FS_LAYOUT. here, if any valid
HFiles are found in the meta table directory, it is not partial
and throws an Exception to crash the Master startup.
@taklwu taklwu force-pushed the HBASE-24833-branch-2 branch from 082d601 to b5010b2 Compare September 7, 2021 17:27
@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 12m 18s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
_ branch-2 Compile Tests _
+1 💚 mvninstall 5m 21s branch-2 passed
+1 💚 compile 4m 23s branch-2 passed
+1 💚 checkstyle 1m 40s branch-2 passed
+1 💚 spotbugs 3m 11s branch-2 passed
_ Patch Compile Tests _
+1 💚 mvninstall 5m 8s the patch passed
+1 💚 compile 3m 48s the patch passed
+1 💚 javac 3m 48s the patch passed
+1 💚 checkstyle 1m 17s the patch passed
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 hadoopcheck 13m 2s Patch does not cause any errors with Hadoop 3.1.2 3.2.1.
+1 💚 spotbugs 2m 22s the patch passed
_ Other Tests _
+1 💚 asflicense 0m 13s The patch does not generate ASF License warnings.
61m 19s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/5/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #2237
JIRA Issue HBASE-24833
Optional Tests dupname asflicense javac spotbugs hadoopcheck hbaseanti checkstyle compile
uname Linux 1ef067af5fe6 4.15.0-147-generic #151-Ubuntu SMP Fri Jun 18 19:21:19 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / 62c58fc
Default Java AdoptOpenJDK-1.8.0_282-b08
Max. process+thread count 86 (vs. ulimit of 12500)
modules C: hbase-server U: hbase-server
Console output https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/5/console
versions git=2.17.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 34s Docker mode activated.
-0 ⚠️ yetus 0m 7s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ branch-2 Compile Tests _
+1 💚 mvninstall 4m 31s branch-2 passed
+1 💚 compile 1m 7s branch-2 passed
+1 💚 shadedjars 7m 10s branch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 44s branch-2 passed
_ Patch Compile Tests _
+1 💚 mvninstall 4m 6s the patch passed
+1 💚 compile 1m 7s the patch passed
+1 💚 javac 1m 7s the patch passed
+1 💚 shadedjars 7m 10s patch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 41s the patch passed
_ Other Tests _
+1 💚 unit 135m 40s hbase-server in the patch passed.
165m 57s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/5/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR #2237
JIRA Issue HBASE-24833
Optional Tests javac javadoc unit shadedjars compile
uname Linux b6554c2ebd46 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / 62c58fc
Default Java AdoptOpenJDK-11.0.10+9
Test Results https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/5/testReport/
Max. process+thread count 3817 (vs. ulimit of 12500)
modules C: hbase-server U: hbase-server
Console output https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/5/console
versions git=2.17.1 maven=3.6.3
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 4m 13s Docker mode activated.
-0 ⚠️ yetus 0m 6s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ branch-2 Compile Tests _
+1 💚 mvninstall 3m 59s branch-2 passed
+1 💚 compile 1m 2s branch-2 passed
+1 💚 shadedjars 6m 36s branch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 44s branch-2 passed
_ Patch Compile Tests _
+1 💚 mvninstall 3m 29s the patch passed
+1 💚 compile 1m 3s the patch passed
+1 💚 javac 1m 3s the patch passed
+1 💚 shadedjars 6m 45s patch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 37s the patch passed
_ Other Tests _
+1 💚 unit 146m 53s hbase-server in the patch passed.
177m 48s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/5/artifact/yetus-jdk8-hadoop2-check/output/Dockerfile
GITHUB PR #2237
JIRA Issue HBASE-24833
Optional Tests javac javadoc unit shadedjars compile
uname Linux e6c4aa84f5e1 4.15.0-136-generic #140-Ubuntu SMP Thu Jan 28 05:20:47 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / 62c58fc
Default Java AdoptOpenJDK-1.8.0_282-b08
Test Results https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/5/testReport/
Max. process+thread count 4237 (vs. ulimit of 12500)
modules C: hbase-server U: hbase-server
Console output https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2237/5/console
versions git=2.17.1 maven=3.6.3
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

Copy link
Contributor

@z-york z-york left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 this looks reasonable

@taklwu
Copy link
Contributor Author

taklwu commented Sep 9, 2021

@joshelser should we wait till add the documentation in this commit ? or should we have a followup PR?

@taklwu taklwu merged commit 3ce9052 into apache:branch-2 Oct 12, 2021
@taklwu
Copy link
Contributor Author

taklwu commented Oct 12, 2021

merged and if we need a new documentation, we can have a followup JIRA

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants